Alerts & Runbooks¶
genai-otel-bridge ships eleven self-observability alert rules under
deploy/grafana/self-obs/alertrule-*.yaml. All rules query genai_otel_bridge_* metrics
directly (no recording-rule dependency) and use noDataState: Ok so the healthy case (no
series at all) never fires spuriously.
Push the entire self-obs/ directory to apply them:
Alert summary¶
| Alert | Severity | When it fires |
|---|---|---|
| GenaiOtelBridgeLeaderAbsent | critical | No successful emit from any replica in 15m |
| GenaiOtelBridgePollerStale | warning | A loop is staler than 2× its own 6h baseline AND > 300s |
| GenaiOtelBridgeEmitFailing | critical | Fatal emit errors in the last 10m |
| GenaiOtelBridgeAuthErrors | critical | Upstream returned 401/403 (credential failure) |
| GenaiOtelBridgeUpstreamErrorBudget | warning | > 20% of requests to an upstream target are errors |
| GenaiOtelBridgeWindowTruncatedDroppingRecords | warning | A log loop truncated a window; records were dropped |
| GenaiOtelBridgeDataLoss | warning | Samples skipped for a real-loss reason (too-old / duplicate) |
| GenaiOtelBridgeBucketRevisedAfterSettle | warning | > 30 settled buckets/h still changing after bucket_settle |
| GenaiOtelBridgeQueueBackpressure | warning | Emit cannot keep up with collect (queue depth > 0 for 15m) |
| GenaiOtelBridgeCardinalitySpike | warning | Sustained creation of new label-value combinations on a series |
| GenaiOtelBridgeNoStandby | warning | Fewer than 2 replicas self-reporting |
GenaiOtelBridgeLeaderAbsent and GenaiOtelBridgePollerStale are complementary:
absent means no series at all (leader gone); stale means the series is present but
ageing past its own normal baseline (leader wedged or loop stuck).
Runbooks¶
GenaiOtelBridgeLeaderAbsent¶
Severity: critical
Fires when: absent_over_time(genai_otel_bridge_last_success_timestamp_seconds[15m]) is true
for 10 minutes — no replica has successfully emitted in 15 minutes.
What to check:
- Check that the Deployment is running:
kubectl get pods -l app=genai-otel-bridge - Check pod logs for startup errors:
kubectl logs -l app=genai-otel-bridge - Check that the Kubernetes Lease exists:
kubectl get lease genai-otel-bridge-leader - Check OTLP egress — the pod must reach the configured OTLP endpoint on port 443. If
genai_otel_bridge_emit_errors_totalis non-zero, the source APIs may be reachable but the OTLP endpoint is blocked. - Check RBAC — the pod's ServiceAccount must be able to
get/updatethe Lease and thegenai-otel-bridge-checkpointsConfigMap.
Note: this alert fires as soon as the metric goes absent; it does not require the loop to have been running previously. On a fresh deployment it will fire until the first successful emit.
See also: Troubleshooting, High Availability.
GenaiOtelBridgePollerStale¶
Severity: warning
Fires when: a loop's last-success age exceeds **2× its own trailing-6h p90 baseline AND
300 seconds** for 15 minutes.
This rule is self-relative — each loop's threshold is derived from its own recent behaviour.
Log-export loops (logs_export, runs) legitimately take tens of minutes per cycle; the
self-relative threshold avoids false positives on those loops while remaining sensitive to
a genuinely stuck snapshot loop (sessions).
What to check:
- Which loop is stale? Check
genai_otel_bridge_window_lag_secondslabelled byloop. - Check
genai_otel_bridge_emit_errors_totalfor that loop — failed emits prevent watermark advancement. - Check
genai_otel_bridge_upstream_request_duration_seconds— slow source API responses increase collect time, which can cause the window lag to grow. - Check
genai_otel_bridge_queue_depth— a full queue blocks collection.
See also: Troubleshooting — stale watermark.
GenaiOtelBridgeEmitFailing¶
Severity: critical
Fires when: fatal emit errors (retryable_exhausted, checkpoint_*, bad_encoding) appear
in the last 10 minutes. Benign upstream-collect retries are excluded.
What to check:
- Check pod logs for OTLP error messages.
- Check that the OTLP endpoint is reachable and accepting data.
- If
bad_encodingerrors appear, this indicates a bug in the OTLP encoder — file an issue. - If
checkpoint_*errors appear, check ConfigMap RBAC and whether the ConfigMap is corrupt.
See also: Troubleshooting.
GenaiOtelBridgeAuthErrors¶
Severity: critical
Fires when: increase(genai_otel_bridge_auth_errors_total[10m]) > 0 — the upstream source API
returned a 401 or 403 response.
What to check:
- Check which
sourcelabel is on the metric — it identifies which source's credential is failing. - Verify the API key / secret for that source is correct and not expired.
- Check that the Kubernetes Secret or environment variable containing the credential is mounted correctly in the pod.
- If the error started recently after a deployment, check for a config change that may have altered the credential reference.
See also: Troubleshooting — auth errors.
GenaiOtelBridgeUpstreamErrorBudget¶
Severity: warning
Fires when: more than 20% of requests to an upstream target are 4xx/5xx or errors
(including timeouts) over 10 minutes.
What to check:
- Check
genai_otel_bridge_upstream_request_duration_secondslabelled bytargetandstatus_classfor the error distribution. - Distinguish between 401/403 (credential failure →
GenaiOtelBridgeAuthErrorsfires too), 429 (quota exceeded — the loop backs off automatically), and 5xx (upstream outage). - Check the upstream platform status page for incidents.
- If the errors are sustained, check that the configured
base_urlis correct.
GenaiOtelBridgeWindowTruncatedDroppingRecords¶
Severity: warning
Fires when: a windowed log loop (runs, logs_export) truncated a window — it advanced
past undrained records with a counted gap. Some log records were dropped.
Query: sum by (loop) (increase(genai_otel_bridge_source_graph_unavailable_total{graph="window_truncated"}[10m])) > 0
The truncated count is unknowable by construction (the loop stops at the page cap).
What to check:
- Which loop truncated? The
looplabel identifies it. - If
logs_exportis truncating: increasesettings.max_pages_per_windowor decreasesettings.windowto reduce the volume per window. - If
runsis truncating: increasesettings.max_pages_per_windowor narrow the scope viasettings.session_filter.
GenaiOtelBridgeDataLoss¶
Severity: warning
Fires when: samples are being skipped for a real-loss reason: too_old (sample outside
Mimir's accept window), payload_too_large (413 on a minimal chunk), or
duplicate_timestamp. Benign reasons are excluded.
What to check:
too_old: the sample is outside Mimir'sout_of_order_time_window. Either the loop'smax_backfillexceeds the Mimir window, or the stack's OOO window is too small for the intended max downtime. Request Grafana Support to raise it (GS2).payload_too_large: the minimum emit chunk exceeds the gateway's payload limit. Reducechunk_max_recordson the relevant logs loop.duplicate_timestamp: two sources are writing to the same(series, timestamp). Check for overlapping series names across sources — this is caught at startup but can appear after config changes.
GenaiOtelBridgeBucketRevisedAfterSettle¶
Severity: warning
Fires when: more than 30 settled buckets per hour are still changing value after
bucket_settle — the late-arrival lag for the analytics loop exceeds the current setting.
What to check:
- Check
genai_otel_bridge_bucket_revised_after_settle_age_secondshistogram for the p95 of observed revision ages. Setbucket_settleto at least the p95 value. - Widening
bucket_settledelays the reporting horizon slightly but eliminates under-counting. Metrics cannot be re-emitted once settled.
See also: Portkey — bucket settle.
GenaiOtelBridgeQueueBackpressure¶
Severity: warning
Fires when: genai_otel_bridge_queue_depth_ratio > 0 is sustained for 15 minutes — the
emit pipeline cannot drain the queue as fast as collection enqueues batches.
What to check:
- Check
genai_otel_bridge_emit_errors_total— repeated emit errors (OTLP 5xx / timeout) fill the queue while workers are busy retrying. - Check the OTLP endpoint latency — slow responses reduce emit throughput.
- Under sustained outage,
window_lagrises andGenaiOtelBridgePollerStalewill eventually fire. On recovery, the loop resumes from its watermark — bounded bymax_backfill.
GenaiOtelBridgeCardinalitySpike¶
Severity: warning
Fires when: new label-value combinations are being created at a sustained rate on a series — a cardinality early warning.
What to check:
- Check
genai_otel_bridge_new_label_values_totallabelled byseriesto identify which metric is growing. - Check whether a high-cardinality field was recently added to
governance.allow_label_keysorsettings.extra_indexed_fields. - For the LangSmith
sessionlabel: check thatsession_filteris set to bound the number of sessions in scope. - Review
governance.per_metric_cardinality_budget— the guard drops over-budget series (counted viagenai_otel_bridge_guard_dropped_total).
See also: Content Governance.
GenaiOtelBridgeNoStandby¶
Severity: warning
Fires when: fewer than 2 replicas are self-reporting (i.e. only 1 replica visible, or none).
This alert is expected to fire on intentionally single-replica dev stacks. In production, fewer than 2 replicas means there is no failover headroom.
What to check:
- Check the Deployment replica count:
kubectl get deploy genai-otel-bridge - Check for pending or crash-looping pods:
kubectl get pods -l app=genai-otel-bridge - On multi-AZ deployments, check that the pod topology spread constraints are satisfiable
(the chart uses
ScheduleAnyway, so this should not block scheduling, but it is worth verifying).
See also¶
- Dashboards — the self-obs dashboard and recording rules
- Telemetry reference —
genai_otel_bridge_*metric definitions - Troubleshooting — detailed failure mode guidance
- High Availability — leader election and failover