Semantic Search (similarto)
Rank logs and spans by meaning with `top K by <field> similarto "<text>"` — how it works, when to use it, and its limits.
similarto ranks rows by how close a field's text is, in meaning, to a
natural-language phrase — so you can find events that look like a description
without knowing the exact words, error codes, or templates in your data.
default
| where timestamp > ago(1h)
| top 100 by body similarto 'database connection timeouts'This returns the 100 closest-matching rows, each with a _score column (a
real in [0, 1], higher = more similar), sorted from most to least similar.
Syntax
There is exactly one shape:
<source> | top K by <field> similarto "<text>"K— how many rows to return (thetoplimit), sorted by_scoredescending.<field>— a single string column or dotted path (e.g.body,attributes.exception.message). A dot means a nested path —attributes.promptdescends theattributesbag to itspromptkey. To target a field whose name literally contains a dot, bracket-escape that segment:attributes['db.statement']reads thedb.statementkey insideattributes, and['http.target']is a single top-level field namedhttp.target. The field must be one your data is indexed on (see Requirements)."<text>"— the search phrase, a string literal. Both'…'and"…"work.
similarto adds a _score: real column to the output; everything else passes
through unchanged. _score is an ordinary column — project it, filter on it,
or re-rank by it like any other.
Examples
// Narrow with a fast filter first, then rank semantically
default
| where Service == 'checkout-api' and timestamp > ago(6h)
| top 50 by body similarto 'payment provider returned 5xx'
| project timestamp, _score, severity_text, body
// Rank a nested attribute instead of the log body
default
| top 50 by attributes.exception.message similarto 'null pointer in serializer'
// Rank wide, then post-filter and re-rank by a domain signal
default
| top 1000 by body similarto 'rate limited by upstream'
| where severity_number >= 17
| top 50 by _scorePut filters you want to narrow the search before the top … similarto —
they shrink the candidate set the ranking runs over (and let Berserk skip data).
Filters placed after it apply to the rows already returned (see
Limitations).
How scoring works (and what that means for you)
Every row is scored, but the similarity is measured against the row's
template, not its raw text. Berserk reduces each row to a template with the
same extract_log_template function used elsewhere in KQL — stripping the
variable parts (ids, numbers, paths, timestamps) down to a skeleton — and
compares your search phrase against those skeletons. Matching is therefore
per-row, but the signal is template-level. The practical consequences:
- Rows that share a template share a score. "Found 10 products" and "Found 20 products" reduce to the same template, so they rank identically — the ranking is over kinds of events, not individual lines.
- Paraphrases of the same kind of event rank together. "auth refused", "credentials rejected", and "401 from idp" land near each other even though they share no words, because their templates embed close together.
- It's a similarity ranking, not a filter.
top Kalways returns up toKrows; low-relevance rows still appear (with low_score) once the closer ones run out. Use awhere _score > …after it if you want a floor. - It's strongest for topics, weakest for one-off lines. Finding "events like database timeouts" works well; finding a single rare line whose template appears nowhere else has little for the ranking to grip on.
When to use similarto vs. lexical search
| You want… | Use |
|---|---|
| Events on a topic, by description ("auth failures") | top K by body similarto '…' |
| An exact token / known phrase you remember | where body has '…' (faster, exact) |
| A regex / structured match | where body matches regex '…' |
For exact strings, has is both faster and
more precise — similarto is for when you don't know the exact wording.
Requirements
similarto only produces meaningful scores when the field is indexed for
semantic search for that dataset (an admin enables it per-field at ingest).
When it isn't:
- The query still runs and returns rows, but
_scoreis0for rows with no indexed template — there's no semantic signal to rank on. - A field name that doesn't exist in the schema is a bind-time error
(
unknown column), same as anywhere else.
If your scores come back all-zero, the field is most likely not configured for semantic search on that dataset — ask whoever administers it.
Limitations
This is the first release of the feature. Current limits:
- One shape only.
similartois valid only as the sole sort key of atop … bydirectly over a table. It is not yet usable insidewhere,extend,project, a second sort key,ascorder, or afterjoin/fork/summarize. Those forms return a clear bind-time error. - One
similartoper query. - The phrase must be a literal string — you can't pass a column or variable as the search text.
- Single field. Searching several fields at once (
(body, attributes…)) is not supported. - Descending only —
top K by … similarto …returns the most similar K. "Least similar" (asc) is rejected. - No score threshold built in. There's no
min_score; filter on_scoreafter thetopif you need a floor (note this filters after the top-K, so it can return fewer than K rows). parse/mv-expandbefore it disable semantic scoring. If an upstream operator rewrites the field's value (e.g.parse body …ormv-expand), Berserk can't trust the indexed template for that field and scores fall back (often to0). Rank before you reshape the field.- Downstream filters don't speed up the search. A
whereafter thetop … similartofilters the rows already chosen; only filters before it narrow what gets ranked. Preferwhere … | top K by … similarto …. - English-centric by default. The default embedding model is English-trained; quality on other languages is poor until a multilingual model is configured.
- Topic search, not anomaly detection.
similartofinds rows like a phrase; it doesn't surface "unusual" events.
See also
- Best Practices — Berserk's indexing and filter-pushdown model.
topoperator — the host operator forsimilarto.