Berserk Docs

Semantic Search (similarto)

Rank logs and spans by meaning with `top K by <field> similarto "<text>"` — how it works, when to use it, and its limits.

similarto ranks rows by how close a field's text is, in meaning, to a natural-language phrase — so you can find events that look like a description without knowing the exact words, error codes, or templates in your data.

default
| where timestamp > ago(1h)
| top 100 by body similarto 'database connection timeouts'

This returns the 100 closest-matching rows, each with a _score column (a real in [0, 1], higher = more similar), sorted from most to least similar.

Syntax

There is exactly one shape:

<source> | top K by <field> similarto "<text>"
  • K — how many rows to return (the top limit), sorted by _score descending.
  • <field> — a single string column or dotted path (e.g. body, attributes.exception.message). A dot means a nested pathattributes.prompt descends the attributes bag to its prompt key. To target a field whose name literally contains a dot, bracket-escape that segment: attributes['db.statement'] reads the db.statement key inside attributes, and ['http.target'] is a single top-level field named http.target. The field must be one your data is indexed on (see Requirements).
  • "<text>" — the search phrase, a string literal. Both '…' and "…" work.

similarto adds a _score: real column to the output; everything else passes through unchanged. _score is an ordinary column — project it, filter on it, or re-rank by it like any other.

Examples

// Narrow with a fast filter first, then rank semantically
default
| where Service == 'checkout-api' and timestamp > ago(6h)
| top 50 by body similarto 'payment provider returned 5xx'
| project timestamp, _score, severity_text, body

// Rank a nested attribute instead of the log body
default
| top 50 by attributes.exception.message similarto 'null pointer in serializer'

// Rank wide, then post-filter and re-rank by a domain signal
default
| top 1000 by body similarto 'rate limited by upstream'
| where severity_number >= 17
| top 50 by _score

Put filters you want to narrow the search before the top … similarto — they shrink the candidate set the ranking runs over (and let Berserk skip data). Filters placed after it apply to the rows already returned (see Limitations).

How scoring works (and what that means for you)

Every row is scored, but the similarity is measured against the row's template, not its raw text. Berserk reduces each row to a template with the same extract_log_template function used elsewhere in KQL — stripping the variable parts (ids, numbers, paths, timestamps) down to a skeleton — and compares your search phrase against those skeletons. Matching is therefore per-row, but the signal is template-level. The practical consequences:

  • Rows that share a template share a score. "Found 10 products" and "Found 20 products" reduce to the same template, so they rank identically — the ranking is over kinds of events, not individual lines.
  • Paraphrases of the same kind of event rank together. "auth refused", "credentials rejected", and "401 from idp" land near each other even though they share no words, because their templates embed close together.
  • It's a similarity ranking, not a filter. top K always returns up to K rows; low-relevance rows still appear (with low _score) once the closer ones run out. Use a where _score > … after it if you want a floor.
  • It's strongest for topics, weakest for one-off lines. Finding "events like database timeouts" works well; finding a single rare line whose template appears nowhere else has little for the ranking to grip on.
You want…Use
Events on a topic, by description ("auth failures")top K by body similarto '…'
An exact token / known phrase you rememberwhere body has '…' (faster, exact)
A regex / structured matchwhere body matches regex '…'

For exact strings, has is both faster and more precise — similarto is for when you don't know the exact wording.

Requirements

similarto only produces meaningful scores when the field is indexed for semantic search for that dataset (an admin enables it per-field at ingest). When it isn't:

  • The query still runs and returns rows, but _score is 0 for rows with no indexed template — there's no semantic signal to rank on.
  • A field name that doesn't exist in the schema is a bind-time error (unknown column), same as anywhere else.

If your scores come back all-zero, the field is most likely not configured for semantic search on that dataset — ask whoever administers it.

Limitations

This is the first release of the feature. Current limits:

  • One shape only. similarto is valid only as the sole sort key of a top … by directly over a table. It is not yet usable inside where, extend, project, a second sort key, asc order, or after join / fork / summarize. Those forms return a clear bind-time error.
  • One similarto per query.
  • The phrase must be a literal string — you can't pass a column or variable as the search text.
  • Single field. Searching several fields at once ((body, attributes…)) is not supported.
  • Descending onlytop K by … similarto … returns the most similar K. "Least similar" (asc) is rejected.
  • No score threshold built in. There's no min_score; filter on _score after the top if you need a floor (note this filters after the top-K, so it can return fewer than K rows).
  • parse / mv-expand before it disable semantic scoring. If an upstream operator rewrites the field's value (e.g. parse body … or mv-expand), Berserk can't trust the indexed template for that field and scores fall back (often to 0). Rank before you reshape the field.
  • Downstream filters don't speed up the search. A where after the top … similarto filters the rows already chosen; only filters before it narrow what gets ranked. Prefer where … | top K by … similarto ….
  • English-centric by default. The default embedding model is English-trained; quality on other languages is poor until a multilingual model is configured.
  • Topic search, not anomaly detection. similarto finds rows like a phrase; it doesn't surface "unusual" events.

See also

On this page