Berserk Docs

Best Practices

Writing fast Berserk queries — and where Berserk's performance model differs from Azure Data Explorer

Most Microsoft Kusto best practices carry over to Berserk, but Berserk's indexing changes which ones matter. A large class of patterns that are "slow, avoid them" in Azure Data Explorer (ADX) — contains, case-insensitive matching, predicates on dynamic fields, free-text and wildcard search — are cheap in Berserk because they reduce to index probes that prune data before any rows are scanned. This page is the ADX best-practices list, re-mapped to Berserk.

How Berserk prunes

Berserk skips data at three levels before reading rows. Knowing them explains every recommendation below:

  • Shard index — segments are sharded by a set of fields configured per table (for example resource['service.name'] and metric_name). An == or in (…) filter on a shard field eliminates whole segments. Sharding is recorded per segment file at ingestion and never re-indexed, so changing the shard fields affects only newly-ingested data — the schema can evolve over time without rewriting history.
  • Range indexdatetime and numeric columns carry per-chunk min/max bounds, so the intrinsic timestamp filter and numeric == / < / > / between prune chunks cheaply.
  • Token bloom — every string and dynamic field has a per-chunk, case-folded token bloom, so string search and equality prune chunks without reading them.

A predicate engages these indexes most reliably when applied to a bare column or path. Wrapping it in a function generally forces per-row evaluation — but the planner can see through a number of common wrappers and projections, so treat this as a strong default rather than an absolute.

In short

GoalIn Berserk, do thisAvoidHow it differs from ADX
Reduce data scannedPut where right after the source so the timestamp filter and field predicates prune chunks (bloom / SHAR / range).Same principle. The CLI also defaults to a 1-hour window — pass --since explicitly to look further back.
Time filtersFilter on the intrinsic timestamp (native datetime) — the strongest prune.Storing time as long/string; shadowing timestamp with a non-datetime project/extend.Same as ADX's "use datetime", but timestamp is the canonical event-time column and drives chunk elimination via the range index.
Equality on a shard fieldwhere resource['service.name'] == "query" or in (…) on a configured shard field (e.g. resource['service.name'], metric_name) — prunes whole segments.Berserk shards segments by configured fields; ==/in on them is eliminated at the segment level, before the chunk bloom even runs.
Numeric / time range==, <, >, between on a numeric or datetime column — the per-chunk min/max range index prunes chunks.Storing numbers/timestamps as strings (no range index).Same as ADX's column-index intuition; numeric and datetime columns are range-indexed.
String search operatorPick by meaning — has (whole token) vs contains (substring) — they cost the same.Major difference. Bloom indexes accelerate has, contains, startswith, endswith, hasprefix, hassuffix equally. The ADX rule "prefer has over contains" does not apply.
Case-insensitive matchcol =~ "value" (and !~).Differs: =~ is index-friendly — its bloom is case-folded, so it prunes like ==, on dynamic fields too. ADX discourages =~ for performance; in Berserk it's the idiomatic form. (tolower(col) == "value" also prunes here, via see-through — but =~ reads better and is portable.)
Case-sensitivityUse == / contains_cs / has_cs when you need exact case — marginally faster (they skip the fold).Differs: the case-insensitive forms still prune in Berserk, so case-sensitive is an optimization, not a requirement. (in~ is not implemented; use in.)
Free-text searchsearch "term" and *-scoped predicates are fast — they lower to a whole-row token bloom and prune chunks, across tokens.Inverted from ADX. ADX says "never *, it forces a full scan." In Berserk * is a first-class indexed path with no per-column materialize or parse.
Regex matchingGive the pattern a clear literal anchormatches regex "GET /api/v[0-9]+". Berserk mines literal substrings from the pattern and uses them as a bloom prefilter, then SIMD-scans only surviving chunks.Anchorless patterns (".*", bare character classes, alternations with no shared literal) — there is nothing to mine, so the match degrades to a full string-field scan.The mined-literal prefilter (incl. * matches regex …) is Berserk-specific; ADX comparable regex search is always a scan.
Column vs whole-rowPrefer a named column (col has "x") when you know the field — field-scoped bloom is more selective.A difference of degree: * is not a disaster in Berserk, just less selective than a named field.
Predicates on dynamic fieldsCompare bare: where attributes.code == 502, where resource.host.name =~ "…". The dynamic path is bloom-indexed and prunes.Wrapping a scan predicate in tostring() / tolower() / tolong().Differs: ADX needs a where Dyn has "v" | where Dyn.k == "v" two-step to avoid parsing every row; Berserk prunes the path-scoped predicate directly.
Dynamic → typedPass dynamic fields straight to typed functions — permissive mode injects the matching asXXX extractor (extract-or-null) only when needed. Use explicit to*() only to cross types, and only in project/extend.to*() inside a where.Berserk-specific: bare dynamic comparisons stay index-engaged; coercion is automatic for typed arguments, never for scan predicates.
Filter on columns, not computed valueswhere predicate(col).extend v = expr | where predicate(v).Same as ADX — and load-bearing here, since the bloom/SHAR prune only sees bare column/path predicates.
Type conversions over huge inputsFilter first, then convert in project/extend.Converting before filtering.Same data-reduction principle.
Exploratory queriesEnd with take N or count.Unbounded scans over unknown data.Same.
Extract from textparse for same-shaped strings; matches regex / extract() for irregular ones; extract_log_template() to cluster log lines.Many extract() calls for uniform input.Same, plus Berserk's log-template functions.
Repeated subexpressionRely on automatic map-reduce state caching (results are reused across overlapping time windows and bin spans).Differs: there is no materialize() — see Not available.
JoinsFilter and shrink both inputs first; use join (or time-windowed correlation) and in (…) instead of a semi-join for single-column filters.in-over-semi-join is the same. Join distribution is automatic; the ADX shuffle/broadcast/lookup tuning knobs are not exposed.

Why so many ADX rules don't apply

The ADX best-practices page is largely a catalogue of ways to avoid scans: prefer has over contains, avoid =~, don't search *, pre-filter dynamic lookups, materialize extracted columns at ingestion. Each exists because, in ADX, those operations fall back to scanning column data.

In Berserk the chunk-level bloom filter is case-folded and token-based, so the operations ADX warns about all become the same cheap primitive — a token probe that prunes chunks before any row is read:

  • has, contains, startswith, … — one token bloom each, equal cost.
  • =~ / case-insensitive — the bloom is already folded, so it prunes like ==.
  • where dynamic.path == v — the path is a bloom key; prunes directly.
  • search "term" / * / * matches regex "…" — whole-row token bloom (regex mines its literal anchor), then a SIMD scan of survivors.

The result: pick operators by semantics, not by a performance cheat sheet. The two rules that still carry real weight are the universal ones — filter early (so pruning runs first) and prefer bare-column predicates, keeping in mind the see-through cases below.

Index see-through

"Filter on a bare column" is a strong default, but the planner recognizes several wrappers and traces a predicate back to an indexed source, so these still prune:

  • Case foldingtolower(col) == "x" / toupper(col) == "X" reuse the case-folded bloom (the same one =~ uses).

  • Safe unwrapping — a safe tostring(col) == "x" and identity numeric to*() comparisons are unwrapped back to the bare column.

  • Projection & parse lineage — a literal compared against a column derived from a string source is traced back to that source. For example:

    __json_logs
    | extend evt = parse_json(body)
    | where evt.action == "exception"

    filters on the computed evt, yet the planner emits a * has "exception" chunk bloom on the source body and prunes before parsing; the evt.action == "exception" check then runs only on surviving rows. The same lineage tracking applies through project and the parse operator.

The literal still has to be a clean token to be bloom-safe: a literal containing spaces or punctuation (e.g. "GET /api") isn't a single token, so the bloom leaf is dropped and that predicate falls back to a scan. This is the same reason a regex needs a clear literal anchor.

When in doubt, look at the physical plan (the --physical plan view): a Bloom:, Range:, or shard line for your predicate means the index is engaged; its absence means you're scanning.

Not available in Berserk

A few ADX best-practice rows have no Berserk equivalent:

  • materialize() — no explicit named-result materialization. Map-reduce state caching covers some of the repeated-computation benefit automatically.
  • Materialized views / materialized_view() — not implemented.
  • Update policies — no ingestion-time field extraction. Rely on bloom indexes and asXXX coercion at query time instead.
  • Join tuninghint.shufflekey, hint.strategy=broadcast, and the lookup operator are not implemented; join distribution is automatic.
  • in~ — use in (case-sensitive).

See Compared to Microsoft KQL and Missing functions for the full compatibility surface.

On this page