Berserk Docs

Service Metrics

OpenTelemetry metrics emitted by Berserk services

All Berserk services emit metrics via OpenTelemetry. Metrics are exported over OTLP to the configured collector endpoint and can be queried in Berserk itself.

Each metric name is prefixed with bzrk. followed by the service scope (e.g. bzrk.ui.query_duration).

A pre-built Grafana dashboard is available for download: bzrk-service-metrics.json. Import it into Grafana and select your Berserk datasource to visualize all metrics below.

Ingest

Simplified OTLP ingest service that receives traces, metrics, and logs over HTTP/gRPC and uploads to S3 via ingest_client.

MetricTypeUnitDescription
bzrk.ingest.queue_rejectionscounterTotal requests rejected due to admission control (semaphore exhaustion or dead stream actor)
bzrk.ingest.batch_flush_durationhistogrammsDuration of batch flush operations (S3 upload latency)
bzrk.ingest.batch_inputshistogramitemsIncoming OTLP requests coalesced into one S3 batch flush. p50/p99 sizes the fan-in expected at the batch upload span links.
bzrk.ingest.data_droppedcounterTotal requests dropped due to missing ingest token
bzrk.ingest.time_since_last_upload_secondsgaugesWorst-silent stream's seconds since last successful S3 upload
bzrk.ingest.inflight_requestsgaugeAdmission permits in use across HTTP/gRPC/Loki transports
bzrk.ingest.buffer_bytesgaugebytesBytes buffered across stream actors (pod-level SUM)
bzrk.ingest.phantom_write_retriescounterTotal retry attempts after an uncertain upload (held by PendingRetry)
bzrk.ingest.phantom_write_retry_budget_exhaustedcounterTotal times a PendingRetry exhausted its budget without resolving (fell back to retryable error)
bzrk.ingest.invalid_otlp_totalcounterOTLP payloads rejected by the fast-path validator. Attributes: signal (traces

Janitor

Background service responsible for segment lifecycle management: merging small segments into larger ones, deleting tombstoned segments from cloud storage, and running probe queries to monitor query service health.

MetricTypeUnitDescription
bzrk.janitor.segment_countgaugeCurrent number of segments in the cluster
bzrk.janitor.total_data_sizegaugebytesTotal size of all segment data in cloud storage
bzrk.janitor.segments_deletedcounterTotal segments deleted from cloud storage
bzrk.janitor.merge_cycle_durationhistogrammsDuration of segment merge cycles
bzrk.janitor.merge_failurescounterTotal failed merge cycles
bzrk.janitor.probe_durationhistogrammsDuration of probe query executions
bzrk.janitor.probes_completedcounterTotal probe queries that completed successfully. Used as the canonical 'query service is reachable' signal — the rate_below alert below fires when the count stops arriving, which only happens if the query service is genuinely unavailable or the janitor itself is stuck. See .claude/skills/berserk-observability/references/alert-framework.md for the canary design.

Nursery

Ingestion service that receives OpenTelemetry data from the collector, converts it into segments, and manages segment merging for optimal query performance.

MetricTypeUnitDescription
bzrk.nursery.streams_activeup_down_counterNumber of currently active stream followers
bzrk.nursery.ingest_lag_secondsgaugesLag of the most-stale active stream (seconds since its last ingest_time)
bzrk.nursery.download_duration_mshistogrammsS3 segment download duration
bzrk.nursery.conversion_duration_mshistogrammsProtobuf to segment conversion duration
bzrk.nursery.total_duration_mshistogrammsTotal segment processing duration (download + conversion)
bzrk.nursery.bytes_ingestedcounterByTotal compressed bytes downloaded from S3 (use rate() for throughput)
bzrk.nursery.bytes_ingested_uncompressedcounterByTotal uncompressed proto bytes ingested (use rate() for throughput)
bzrk.nursery.segment_output_bytescounterByTotal bytes of segment files produced (use rate() for throughput)
bzrk.nursery.data_errorscounterData errors (malformed protobuf, conversion failures)
bzrk.nursery.infra_errorscounterInfrastructure errors (S3 failures, I/O errors)
bzrk.nursery.active_streamsgaugeNumber of active streams reported by Meta
bzrk.nursery.closed_streamsgaugeNumber of closed streams reported by Meta
bzrk.nursery.merge_countcounterTotal number of completed merges
bzrk.nursery.merge_inputshistogramsegmentsIngest segments consumed by one baby-segment merge. p50/p99 sizes the fan-in expected at the nursery merge span links.
bzrk.nursery.merge_output_size_mbhistogramMBCompressed output size of merged segments
bzrk.nursery.merge_durationhistogrammsDuration of segment merge operations
bzrk.nursery.merge_speed_mbpshistogramMB/sMerge throughput in megabytes per second
bzrk.nursery.oldest_unmerged_data_age_secondsgaugesAge of the oldest unmerged baby segment in seconds
bzrk.nursery.events_ingestedcounterTotal events ingested across all streams
bzrk.nursery.ingest_delayhistogrammsDelay between event timestamp and ingest time

Query

Query execution service that receives KQL queries over HTTP and gRPC, plans and executes them against segments, and streams results back to clients.

MetricTypeUnitDescription
bzrk.query.execution_durationhistogrammsEnd-to-end query execution duration
bzrk.query.requestscounterTotal query requests received
bzrk.query.result_rowshistogramNumber of rows returned per query
bzrk.query.errorscounterTotal query errors by error type
bzrk.query.open_fdsgaugebzrk_lib::count_open_fds() periodic sample (10s interval).
Pair with bzrk.query.fd_limit to compute open_fds / fd_limit
on dashboards/alerts without joining against startup logs.
apps/query in cache_mode=remote holds a UDS connection per worker
task plus the SCM_RIGHTS cache_fd + shm_fd passed by cache_server,
so the count tracks engine concurrency directly. Symmetric with
bzrk.cache_server.open_fds.
bzrk.query.fd_limitgaugeCurrent RLIMIT_NOFILE soft cap. Companion to open_fds
sampled on the same 10s tick so dashboards can show
"fds: N / LIMIT (X%)" and alerts can fire on
open_fds / fd_limit > 0.8 before saturation. Production
binaries raise the soft limit to the hard cap at startup, so
this is effectively static; emitting it as a gauge keeps the
query simple.

Ui

Web UI for querying Berserk.

MetricTypeUnitDescription
bzrk.ui.query_durationhistogrammsDuration of proxied queries from start to stream completion
bzrk.ui.site_visitscounterNumber of page visits to the UI
bzrk.ui.browser_span_durationhistogrammsDuration of spans reported by the browser via /api/telemetry/spans

On this page