Best Practices
Writing fast Berserk queries — and where Berserk's performance model differs from Azure Data Explorer
Most Microsoft Kusto best practices carry over to Berserk, but Berserk's indexing changes which ones matter. A large class of patterns that are "slow, avoid them" in Azure Data Explorer (ADX) — contains, case-insensitive matching, predicates on dynamic fields, free-text and wildcard search — are cheap in Berserk because they reduce to index probes that prune data before any rows are scanned. This page is the ADX best-practices list, re-mapped to Berserk.
How Berserk prunes
Berserk skips data at three levels before reading rows. Knowing them explains every recommendation below:
- Shard index — segments are sharded by a set of fields configured per table (for example
resource['service.name']andmetric_name). An==orin (…)filter on a shard field eliminates whole segments. Sharding is recorded per segment file at ingestion and never re-indexed, so changing the shard fields affects only newly-ingested data — the schema can evolve over time without rewriting history. - Range index —
datetimeand numeric columns carry per-chunk min/max bounds, so the intrinsictimestampfilter and numeric==/</>/betweenprune chunks cheaply. - Token bloom — every string and dynamic field has a per-chunk, case-folded token bloom, so string search and equality prune chunks without reading them.
A predicate engages these indexes most reliably when applied to a bare column or path. Wrapping it in a function generally forces per-row evaluation — but the planner can see through a number of common wrappers and projections, so treat this as a strong default rather than an absolute.
In short
| Goal | In Berserk, do this | Avoid | How it differs from ADX |
|---|---|---|---|
| Reduce data scanned | Put where right after the source so the timestamp filter and field predicates prune chunks (bloom / SHAR / range). | — | Same principle. The CLI also defaults to a 1-hour window — pass --since explicitly to look further back. |
| Time filters | Filter on the intrinsic timestamp (native datetime) — the strongest prune. | Storing time as long/string; shadowing timestamp with a non-datetime project/extend. | Same as ADX's "use datetime", but timestamp is the canonical event-time column and drives chunk elimination via the range index. |
| Equality on a shard field | where resource['service.name'] == "query" or in (…) on a configured shard field (e.g. resource['service.name'], metric_name) — prunes whole segments. | — | Berserk shards segments by configured fields; ==/in on them is eliminated at the segment level, before the chunk bloom even runs. |
| Numeric / time range | ==, <, >, between on a numeric or datetime column — the per-chunk min/max range index prunes chunks. | Storing numbers/timestamps as strings (no range index). | Same as ADX's column-index intuition; numeric and datetime columns are range-indexed. |
| String search operator | Pick by meaning — has (whole token) vs contains (substring) — they cost the same. | — | Major difference. Bloom indexes accelerate has, contains, startswith, endswith, hasprefix, hassuffix equally. The ADX rule "prefer has over contains" does not apply. |
| Case-insensitive match | col =~ "value" (and !~). | — | Differs: =~ is index-friendly — its bloom is case-folded, so it prunes like ==, on dynamic fields too. ADX discourages =~ for performance; in Berserk it's the idiomatic form. (tolower(col) == "value" also prunes here, via see-through — but =~ reads better and is portable.) |
| Case-sensitivity | Use == / contains_cs / has_cs when you need exact case — marginally faster (they skip the fold). | — | Differs: the case-insensitive forms still prune in Berserk, so case-sensitive is an optimization, not a requirement. (in~ is not implemented; use in.) |
| Free-text search | search "term" and *-scoped predicates are fast — they lower to a whole-row token bloom and prune chunks, across tokens. | — | Inverted from ADX. ADX says "never *, it forces a full scan." In Berserk * is a first-class indexed path with no per-column materialize or parse. |
| Regex matching | Give the pattern a clear literal anchor — matches regex "GET /api/v[0-9]+". Berserk mines literal substrings from the pattern and uses them as a bloom prefilter, then SIMD-scans only surviving chunks. | Anchorless patterns (".*", bare character classes, alternations with no shared literal) — there is nothing to mine, so the match degrades to a full string-field scan. | The mined-literal prefilter (incl. * matches regex …) is Berserk-specific; ADX comparable regex search is always a scan. |
| Column vs whole-row | Prefer a named column (col has "x") when you know the field — field-scoped bloom is more selective. | — | A difference of degree: * is not a disaster in Berserk, just less selective than a named field. |
| Predicates on dynamic fields | Compare bare: where attributes.code == 502, where resource.host.name =~ "…". The dynamic path is bloom-indexed and prunes. | Wrapping a scan predicate in tostring() / tolower() / tolong(). | Differs: ADX needs a where Dyn has "v" | where Dyn.k == "v" two-step to avoid parsing every row; Berserk prunes the path-scoped predicate directly. |
| Dynamic → typed | Pass dynamic fields straight to typed functions — permissive mode injects the matching asXXX extractor (extract-or-null) only when needed. Use explicit to*() only to cross types, and only in project/extend. | to*() inside a where. | Berserk-specific: bare dynamic comparisons stay index-engaged; coercion is automatic for typed arguments, never for scan predicates. |
| Filter on columns, not computed values | where predicate(col). | extend v = expr | where predicate(v). | Same as ADX — and load-bearing here, since the bloom/SHAR prune only sees bare column/path predicates. |
| Type conversions over huge inputs | Filter first, then convert in project/extend. | Converting before filtering. | Same data-reduction principle. |
| Exploratory queries | End with take N or count. | Unbounded scans over unknown data. | Same. |
| Extract from text | parse for same-shaped strings; matches regex / extract() for irregular ones; extract_log_template() to cluster log lines. | Many extract() calls for uniform input. | Same, plus Berserk's log-template functions. |
| Repeated subexpression | Rely on automatic map-reduce state caching (results are reused across overlapping time windows and bin spans). | — | Differs: there is no materialize() — see Not available. |
| Joins | Filter and shrink both inputs first; use join (or time-windowed correlation) and in (…) instead of a semi-join for single-column filters. | — | in-over-semi-join is the same. Join distribution is automatic; the ADX shuffle/broadcast/lookup tuning knobs are not exposed. |
Why so many ADX rules don't apply
The ADX best-practices page is largely a catalogue of ways to avoid scans: prefer has over contains, avoid =~, don't search *, pre-filter dynamic lookups, materialize extracted columns at ingestion. Each exists because, in ADX, those operations fall back to scanning column data.
In Berserk the chunk-level bloom filter is case-folded and token-based, so the operations ADX warns about all become the same cheap primitive — a token probe that prunes chunks before any row is read:
has,contains,startswith, … — one token bloom each, equal cost.=~/ case-insensitive — the bloom is already folded, so it prunes like==.where dynamic.path == v— the path is a bloom key; prunes directly.search "term"/*/* matches regex "…"— whole-row token bloom (regex mines its literal anchor), then a SIMD scan of survivors.
The result: pick operators by semantics, not by a performance cheat sheet. The two rules that still carry real weight are the universal ones — filter early (so pruning runs first) and prefer bare-column predicates, keeping in mind the see-through cases below.
Index see-through
"Filter on a bare column" is a strong default, but the planner recognizes several wrappers and traces a predicate back to an indexed source, so these still prune:
-
Case folding —
tolower(col) == "x"/toupper(col) == "X"reuse the case-folded bloom (the same one=~uses). -
Safe unwrapping — a safe
tostring(col) == "x"and identity numericto*()comparisons are unwrapped back to the bare column. -
Projection & parse lineage — a literal compared against a column derived from a string source is traced back to that source. For example:
__json_logs | extend evt = parse_json(body) | where evt.action == "exception"filters on the computed
evt, yet the planner emits a* has "exception"chunk bloom on the sourcebodyand prunes before parsing; theevt.action == "exception"check then runs only on surviving rows. The same lineage tracking applies throughprojectand theparseoperator.
The literal still has to be a clean token to be bloom-safe: a literal containing spaces or punctuation (e.g. "GET /api") isn't a single token, so the bloom leaf is dropped and that predicate falls back to a scan. This is the same reason a regex needs a clear literal anchor.
When in doubt, look at the physical plan (the --physical plan view): a Bloom:, Range:, or shard line for your predicate means the index is engaged; its absence means you're scanning.
Not available in Berserk
A few ADX best-practice rows have no Berserk equivalent:
materialize()— no explicit named-result materialization. Map-reduce state caching covers some of the repeated-computation benefit automatically.- Materialized views /
materialized_view()— not implemented. - Update policies — no ingestion-time field extraction. Rely on bloom indexes and
asXXXcoercion at query time instead. - Join tuning —
hint.shufflekey,hint.strategy=broadcast, and thelookupoperator are not implemented; join distribution is automatic. in~— usein(case-sensitive).
See Compared to Microsoft KQL and Missing functions for the full compatibility surface.