Best Practices

Writing fast Berserk queries — and where Berserk's performance model differs from Azure Data Explorer

Most Microsoft Kusto best practices carry over to Berserk, but Berserk's indexing changes which ones matter. A large class of patterns that are "slow, avoid them" in Azure Data Explorer (ADX) — contains, case-insensitive matching, predicates on dynamic fields, free-text and wildcard search — are cheap in Berserk because they reduce to index probes that prune data before any rows are scanned. This page is the ADX best-practices list, re-mapped to Berserk.

How Berserk prunes

Berserk skips data at three levels before reading rows. Knowing them explains every recommendation below:

Shard index — each table is indexed on a few chosen fields (commonly resource['service.name'] and metric_name). Filtering one of them with ==, in (…), or isNotNull(…) lets Berserk drop everything that can't match before reading any data. Lining your filters up with a table's shard fields is the single biggest win.
Range index — datetime and numeric columns are range-indexed, so the intrinsic timestamp filter and numeric == / < / > / between narrow the data cheaply. Keep timestamps and quantities in real datetime / numeric columns — not strings — to get it.
Token bloom — string and dynamic-string values are token-indexed, so string search and equality — has, contains, startswith, ==, =~, and free-text search — skip data without reading rows.

A predicate engages these indexes most reliably when applied to a bare column or path. Wrapping it in a function generally forces per-row evaluation — but the planner can see through a number of common wrappers and projections, so treat this as a strong default rather than an absolute.

In short

Goal	In Berserk, do this	Avoid	How it differs from ADX
Reduce data scanned	Put `where` right after the source so the `timestamp` filter and field predicates prune chunks (bloom / SHAR / range).	—	Same principle. The CLI has no default window — pass `--since` (or bound the query on `timestamp`/`ingest_time`), or the scan errors.
Time filters	Filter on the intrinsic `timestamp` (native `datetime`) — the strongest prune.	Storing time as `long`/string; shadowing `timestamp` with a non-datetime `project`/`extend`.	Same as ADX's "use `datetime`", but `timestamp` is the canonical event-time column and drives chunk elimination via the range index.
Equality on a shard field	`where resource['service.name'] == "query"` or `in (…)` on a configured shard field (e.g. `resource['service.name']`, `metric_name`) — prunes whole segments.	—	Berserk shards segments by configured fields; `==`/`in` on them is eliminated at the segment level, before the chunk bloom even runs.
Numeric / time range	`==`, `<`, `>`, `between` on a numeric or `datetime` column — the per-chunk min/max range index prunes chunks.	Storing numbers/timestamps as strings (no range index).	Same as ADX's column-index intuition; numeric and datetime columns are range-indexed.
String search operator	Pick by meaning — `has` (whole token) vs `contains` (substring) — they cost the same.	—	Major difference. Bloom indexes accelerate `has`, `contains`, `startswith`, `endswith`, `hasprefix`, `hassuffix` equally. The ADX rule "prefer `has` over `contains`" does not apply.
Case-insensitive match	`col =~ "value"` (and `!~`).	—	Differs: `=~` is index-friendly — its bloom is case-folded, so it prunes like `==`, on dynamic fields too. ADX discourages `=~` for performance; in Berserk it's the idiomatic form. (`tolower(col) == "value"` also prunes here, via see-through — but `=~` reads better and is portable.)
Case-sensitivity	Use `==` / `contains_cs` / `has_cs` when you need exact case — marginally faster (they skip the fold).	—	Differs: the case-insensitive forms still prune in Berserk, so case-sensitive is an optimization, not a requirement.
Free-text search	`search "term"` and `*`-scoped predicates are fast — they lower to a whole-row token bloom and prune chunks, across tokens.	—	Inverted from ADX. ADX says "never ``, it forces a full scan." In Berserk `` is a first-class indexed path with no per-column materialize or parse.
Regex matching	Give the pattern a clear literal anchor — `matches regex "GET /api/v[0-9]+"`. Berserk mines literal substrings from the pattern and uses them as a bloom prefilter, then SIMD-scans only surviving chunks.	Anchorless patterns (`".*"`, bare character classes, alternations with no shared literal) — there is nothing to mine, so the match degrades to a full string-field scan.	The mined-literal prefilter (incl. `* matches regex …`) is Berserk-specific; ADX comparable regex search is always a scan.
Column vs whole-row	Prefer a named column (`col has "x"`) when you know the field — field-scoped bloom is more selective.	—	A difference of degree: `*` is not a disaster in Berserk, just less selective than a named field.
Predicates on dynamic fields	Compare bare: `where attributes.code == 502`, `where resource.host.name =~ "…"`. The dynamic path is bloom-indexed and prunes.	Wrapping a scan predicate in `tostring()` / `tolower()` / `tolong()`.	Differs: ADX needs a `where Dyn has "v" \| where Dyn.k == "v"` two-step to avoid parsing every row; Berserk prunes the path-scoped predicate directly.
Dynamic → typed	Pass dynamic fields straight to typed functions — permissive mode injects the matching `asXXX` extractor (extract-or-null) only when needed. Use explicit `to*()` only to cross types, and only in `project`/`extend`.	`to*()` inside a `where`.	Berserk-specific: bare dynamic comparisons stay index-engaged; coercion is automatic for typed arguments, never for scan predicates.
Filter on columns, not computed values	`where predicate(col)`.	`extend v = expr \| where predicate(v)`.	Same as ADX — and load-bearing here, since the bloom/SHAR prune only sees bare column/path predicates.
Type conversions over huge inputs	Filter first, then convert in `project`/`extend`.	Converting before filtering.	Same data-reduction principle.
Exploratory queries	End with `take N` or `count`.	Unbounded scans over unknown data.	Same.
Extract from text	`parse` for same-shaped strings; `matches regex` / `extract()` for irregular ones; `extract_log_template()` to cluster log lines.	Many `extract()` calls for uniform input.	Same, plus Berserk's log-template functions.
Repeated subexpression	Rely on automatic map-reduce state caching (results are reused across overlapping time windows and bin spans).	—	Differs: there is no `materialize()` — see Not available.
Joins	Filter and shrink both inputs first; use `join` (or time-windowed correlation) and `in (…)` instead of a semi-join for single-column filters.	—	`in`-over-semi-join is the same. Join distribution is automatic; the ADX shuffle/broadcast/`lookup` tuning knobs are not exposed.

Why so many ADX rules don't apply

The ADX best-practices page is largely a catalogue of ways to avoid scans: prefer has over contains, avoid =~, don't search *, pre-filter dynamic lookups, materialize extracted columns at ingestion. Each exists because, in ADX, those operations fall back to scanning column data.

In Berserk the chunk-level bloom filter is case-folded and token-based, so the operations ADX warns about all become the same cheap primitive — a token probe that prunes chunks before any row is read:

has, contains, startswith, … — one token bloom each, equal cost.
=~ / case-insensitive — the bloom is already folded, so it prunes like ==.
where dynamic.path == "string" — the string is a bloom key; prunes directly. Also skips chunks that have no rows with that path.
search "term" / * / * matches regex "…" — whole-row token bloom (regex mines its literal anchor), then a SIMD scan of survivors.

The result: pick operators by semantics, not by a performance cheat sheet. The two rules that still carry real weight are the universal ones — filter early (so pruning runs first) and prefer bare-column predicates, keeping in mind the see-through cases below.

Index see-through

"Filter on a bare column" is a strong default, but the planner recognizes several wrappers and traces a predicate back to an indexed source, so these still prune:

Case folding — tolower(col) == "x" / toupper(col) == "X" reuse the case-folded bloom (the same one =~ uses).
Safe unwrapping — a safe tostring(col) == "x" and identity numeric to*() comparisons are unwrapped back to the bare column.
Projection & parse lineage — a literal compared against a column derived from a string source is traced back to that source. For example:
```
__json_logs
| extend evt = parse_json(body)
| where evt.action == "exception"
```
filters on the computed evt, yet the planner emits a * has "exception" chunk bloom on the source body and prunes before parsing; the evt.action == "exception" check then runs only on surviving rows. The same lineage tracking applies through project and the parse operator.

The literal still has to be a clean token to be bloom-safe: a literal containing spaces or punctuation (e.g. "GET /api") isn't a single token, so the bloom leaf is dropped and that predicate falls back to a scan. This is the same reason a regex needs a clear literal anchor.

When in doubt, look at the physical plan (the --physical plan view): a Bloom:, Range:, or shard line for your predicate means the index is engaged; its absence means you're scanning.

Not available in Berserk

A few ADX best-practice rows have no Berserk equivalent:

materialize() — no explicit named-result materialization. Map-reduce state caching covers some of the repeated-computation benefit automatically.
Materialized views / materialized_view() — not implemented.
Update policies — no ingestion-time field extraction. Rely on bloom indexes and asXXX coercion at query time instead.
Join tuning — hint.shufflekey, hint.strategy=broadcast, and the lookup operator are not implemented; join distribution is automatic.

See Compared to Microsoft KQL and Missing functions for the full compatibility surface.

How Berserk prunes

In short

Why so many ADX rules don't apply

Index see-through

Not available in Berserk

On this page