Service Metrics
OpenTelemetry metrics emitted by Berserk services
All Berserk services emit metrics via OpenTelemetry. Metrics are exported over OTLP to the configured collector endpoint and can be queried in Berserk itself.
Each metric name is prefixed with bzrk. followed by the service scope (e.g. bzrk.ui.query_duration).
A pre-built Grafana dashboard is available for download: bzrk-service-metrics.json. Import it into Grafana and select your Berserk datasource to visualize all metrics below.
Ingest
Simplified OTLP ingest service that receives traces, metrics, and logs over HTTP/gRPC and uploads to S3 via ingest_client.
| Metric | Type | Unit | Description |
|---|---|---|---|
bzrk.ingest.queue_rejections | counter | — | Total requests rejected due to admission control (semaphore exhaustion or dead stream actor) |
bzrk.ingest.batch_flush_duration | histogram | ms | Duration of batch flush operations (S3 upload latency) |
bzrk.ingest.batch_inputs | histogram | items | Incoming OTLP requests coalesced into one S3 batch flush. p50/p99 sizes the fan-in expected at the batch upload span links. |
bzrk.ingest.data_dropped | counter | — | Total requests dropped due to missing ingest token |
bzrk.ingest.time_since_last_upload_seconds | gauge | s | Worst-silent stream's seconds since last successful S3 upload |
bzrk.ingest.inflight_requests | gauge | — | Admission permits in use across HTTP/gRPC/Loki transports |
bzrk.ingest.buffer_bytes | gauge | bytes | Bytes buffered across stream actors (pod-level SUM) |
bzrk.ingest.phantom_write_retries | counter | — | Total retry attempts after an uncertain upload (held by PendingRetry) |
bzrk.ingest.phantom_write_retry_budget_exhausted | counter | — | Total times a PendingRetry exhausted its budget without resolving (fell back to retryable error) |
bzrk.ingest.invalid_otlp_total | counter | — | OTLP payloads rejected by the fast-path validator. Attributes: signal (traces |
Janitor
Background service responsible for segment lifecycle management: merging small segments into larger ones, deleting tombstoned segments from cloud storage, and running probe queries to monitor query service health.
| Metric | Type | Unit | Description |
|---|---|---|---|
bzrk.janitor.segment_count | gauge | — | Current number of segments in the cluster |
bzrk.janitor.total_data_size | gauge | bytes | Total size of all segment data in cloud storage |
bzrk.janitor.segments_deleted | counter | — | Total segments deleted from cloud storage |
bzrk.janitor.merge_cycle_duration | histogram | ms | Duration of segment merge cycles |
bzrk.janitor.merge_failures | counter | — | Total failed merge cycles |
bzrk.janitor.probe_duration | histogram | ms | Duration of probe query executions |
bzrk.janitor.probes_completed | counter | — | Total probe queries that completed successfully. Used as the canonical 'query service is reachable' signal — the rate_below alert below fires when the count stops arriving, which only happens if the query service is genuinely unavailable or the janitor itself is stuck. See .claude/skills/berserk-observability/references/alert-framework.md for the canary design. |
Nursery
Ingestion service that receives OpenTelemetry data from the collector, converts it into segments, and manages segment merging for optimal query performance.
| Metric | Type | Unit | Description |
|---|---|---|---|
bzrk.nursery.streams_active | up_down_counter | — | Number of currently active stream followers |
bzrk.nursery.ingest_lag_seconds | gauge | s | Lag of the most-stale active stream (seconds since its last ingest_time) |
bzrk.nursery.download_duration_ms | histogram | ms | S3 segment download duration |
bzrk.nursery.conversion_duration_ms | histogram | ms | Protobuf to segment conversion duration |
bzrk.nursery.total_duration_ms | histogram | ms | Total segment processing duration (download + conversion) |
bzrk.nursery.bytes_ingested | counter | By | Total compressed bytes downloaded from S3 (use rate() for throughput) |
bzrk.nursery.bytes_ingested_uncompressed | counter | By | Total uncompressed proto bytes ingested (use rate() for throughput) |
bzrk.nursery.segment_output_bytes | counter | By | Total bytes of segment files produced (use rate() for throughput) |
bzrk.nursery.data_errors | counter | — | Data errors (malformed protobuf, conversion failures) |
bzrk.nursery.infra_errors | counter | — | Infrastructure errors (S3 failures, I/O errors) |
bzrk.nursery.active_streams | gauge | — | Number of active streams reported by Meta |
bzrk.nursery.closed_streams | gauge | — | Number of closed streams reported by Meta |
bzrk.nursery.merge_count | counter | — | Total number of completed merges |
bzrk.nursery.merge_inputs | histogram | segments | Ingest segments consumed by one baby-segment merge. p50/p99 sizes the fan-in expected at the nursery merge span links. |
bzrk.nursery.merge_output_size_mb | histogram | MB | Compressed output size of merged segments |
bzrk.nursery.merge_duration | histogram | ms | Duration of segment merge operations |
bzrk.nursery.merge_speed_mbps | histogram | MB/s | Merge throughput in megabytes per second |
bzrk.nursery.oldest_unmerged_data_age_seconds | gauge | s | Age of the oldest unmerged baby segment in seconds |
bzrk.nursery.events_ingested | counter | — | Total events ingested across all streams |
bzrk.nursery.ingest_delay | histogram | ms | Delay between event timestamp and ingest time |
Query
Query execution service that receives KQL queries over HTTP and gRPC, plans and executes them against segments, and streams results back to clients.
| Metric | Type | Unit | Description |
|---|---|---|---|
bzrk.query.execution_duration | histogram | ms | End-to-end query execution duration |
bzrk.query.requests | counter | — | Total query requests received |
bzrk.query.result_rows | histogram | — | Number of rows returned per query |
bzrk.query.errors | counter | — | Total query errors by error type |
bzrk.query.open_fds | gauge | — | bzrk_lib::count_open_fds() periodic sample (10s interval). |
Pair with bzrk.query.fd_limit to compute open_fds / fd_limit | |||
| on dashboards/alerts without joining against startup logs. | |||
| apps/query in cache_mode=remote holds a UDS connection per worker | |||
| task plus the SCM_RIGHTS cache_fd + shm_fd passed by cache_server, | |||
| so the count tracks engine concurrency directly. Symmetric with | |||
bzrk.cache_server.open_fds. | |||
bzrk.query.fd_limit | gauge | — | Current RLIMIT_NOFILE soft cap. Companion to open_fds — |
| sampled on the same 10s tick so dashboards can show | |||
| "fds: N / LIMIT (X%)" and alerts can fire on | |||
open_fds / fd_limit > 0.8 before saturation. Production | |||
| binaries raise the soft limit to the hard cap at startup, so | |||
| this is effectively static; emitting it as a gauge keeps the | |||
| query simple. | |||
Ui
Web UI for querying Berserk.
| Metric | Type | Unit | Description |
|---|---|---|---|
bzrk.ui.query_duration | histogram | ms | Duration of proxied queries from start to stream completion |
bzrk.ui.site_visits | counter | — | Number of page visits to the UI |
bzrk.ui.browser_span_duration | histogram | ms | Duration of spans reported by the browser via /api/telemetry/spans |