Berserk Docs

Regex Syntax

Regular expression syntax supported in KQL

Regular expressions in KQL are used by operators and functions such as matches regex, parse, and replace_regex().

Regular expressions must be encoded as string literals and follow the string quoting rules. For example, the regular expression \A is represented in KQL as "\\A". The extra backslash indicates that the other backslash is part of the regular expression \A.

Match one character

PatternDescription
.Any character except newline (includes newline with s flag)
[0-9]Any ASCII digit
[^0-9]Any character that isn't an ASCII digit
\dDigit (\p{Nd})
\DNot a digit
\pXUnicode character class identified by a one-letter name
\p{Greek}Unicode character class (general category or script)
\PXNegated Unicode character class identified by a one-letter name
\P{Greek}Negated Unicode character class (general category or script)

Character classes

PatternDescription
[xyz]Matching either x, y or z (union)
[^xyz]Matching any character except x, y, and z
[a-z]Matching any character in range a-z
[[:alpha:]]ASCII character class ([A-Za-z])
[[:^alpha:]]Negated ASCII character class ([^A-Za-z])
[x[^xyz]]Nested/grouping class (matching any character except y and z)
[a-y&&xyz]Intersection (matching x or y)
[0-9&&[^4]]Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4]Direct subtraction (matching 0-9 except 4)
[a-g~~b-h]Symmetric difference (matching a and h only)
[\[\]]Escape in character classes (matching [ or ])

Any named character class may appear inside a bracketed [...] character class. For example, [\p{Greek}[:digit:]] matches any ASCII digit or any codepoint in the Greek script.

Precedence (most binding to least binding):

  1. Ranges: [a-cd] == [[a-c]d]
  2. Union: [ab&&bc] == [[ab]&&[bc]]
  3. Intersection, difference, symmetric difference: equal precedence, evaluated left-to-right
  4. Negation: [^a-z&&b] == [^[a-z&&b]]

Composites

PatternDescription
xyConcatenation (x followed by y)
x|yAlternation (x or y, prefer x)

Repetitions

PatternDescription
x*Zero or more of x (greedy)
x+One or more of x (greedy)
x?Zero or one of x (greedy)
x*?Zero or more of x (ungreedy/lazy)
x+?One or more of x (ungreedy/lazy)
x??Zero or one of x (ungreedy/lazy)
x{n,m}At least n x and at most m x (greedy)
x{n,}At least n x (greedy)
x{n}Exactly n x
x{n,m}?At least n x and at most m x (ungreedy/lazy)
x{n,}?At least n x (ungreedy/lazy)

Anchors

PatternDescription
^Beginning of haystack, or start-of-line with multi-line mode
$End of haystack, or end-of-line with multi-line mode
\AOnly the beginning of a haystack (even with multi-line mode)
\zOnly the end of a haystack (even with multi-line mode)
\bUnicode word boundary (\w on one side and \W, \A, or \z on other)
\BNot a Unicode word boundary

Grouping and flags

PatternDescription
(exp)Numbered capture group (indexed by opening parenthesis)
(?P<name>exp)Named capture group
(?<name>exp)Named capture group
(?:exp)Non-capturing group
(?flags)Set flags within current group
(?flags:exp)Set flags for exp (non-capturing)

Flags

FlagDescription
iCase-insensitive: letters match both upper and lower case
mMulti-line mode: ^ and $ match begin/end of line
sAllow . to match \n
RCRLF mode: when multi-line mode is enabled, \r\n is used
USwap the meaning of x* and x*?
uUnicode support (enabled by default)
xVerbose mode, ignores whitespace and allows line comments starting with #

Flags can be toggled within a pattern. For example, (?i)a+(?-i)b+ uses a case-insensitive match for a+ and a case-sensitive match for b+.

Escape sequences

PatternDescription
\*Literal * (applies to all ASCII except [0-9A-Za-z<>])
\aBell (\x07)
\fForm feed (\x0C)
\tHorizontal tab
\nNew line
\rCarriage return
\vVertical tab (\x0B)
\123Octal character code, up to three digits
\x7FHex character code (exactly two digits)
\x{10FFFF}Hex character code (Unicode code point)
\u007FHex character code (exactly four digits)

Perl character classes (Unicode)

Based on UTS#18:

PatternDescription
\dDigit (\p{Nd})
\DNot digit
\sWhitespace (\p{White_Space})
\SNot whitespace
\wWord character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\WNot word character

ASCII character classes

PatternDescription
[[:alnum:]]Alphanumeric ([0-9A-Za-z])
[[:alpha:]]Alphabetic ([A-Za-z])
[[:ascii:]]ASCII ([\x00-\x7F])
[[:blank:]]Blank ([\t ])
[[:cntrl:]]Control ([\x00-\x1F\x7F])
[[:digit:]]Digits ([0-9])
[[:graph:]]Graphical ([!-~])
[[:lower:]]Lower case ([a-z])
[[:print:]]Printable ([ -~])
[[:punct:]]Punctuation ([!-/:-@[-`{-~])
[[:space:]]Whitespace ([\t\n\v\f\r ])
[[:upper:]]Upper case ([A-Z])
[[:word:]]Word characters ([0-9A-Za-z_])
[[:xdigit:]]Hex digit ([0-9A-Fa-f])

Performance tips

  • Unicode affects memory and speed: Unicode character classes like \w match ~140,000 codepoints. If ASCII suffices, use [0-9A-Za-z_] or (?-u:\w) instead.
  • Word boundaries: If you don't need Unicode-aware word boundaries, (?-u:\b) is faster than \b.
  • Literals accelerate searches: Including literal characters in your pattern helps the regex engine optimize. For example, in \w+@\w+, the @ is matched first, then a reverse match finds the start.

On this page