Dashboards¶
genai-otel-bridge ships Grafana resources as gcx-native resource manifests under deploy/grafana/.
Resources are split by role — self-observability (the bridge's own health) and product
telemetry (Portkey, LangSmith signals). Push each role to the appropriate stack.
Layout¶
deploy/grafana/
├── self-obs/ # genai_otel_bridge_* signals — push to your self-obs stack
│ ├── folder.yaml
│ ├── dashboard-self-obs.yaml
│ ├── alertrule-*.yaml (11 alert rules)
│ └── recordingrule-*.yaml (7 recording rules)
└── product/ # portkey_api_* and langsmith_* signals — push to your product stack
├── folder.yaml
└── recordingrule-*.yaml (5 recording rules)
Applying resources¶
Resources are applied with the gcx CLI. The --context flag selects the target Grafana
stack. Substitute your own stack context names for the placeholders below:
# Self-obs rules, alerts, and dashboard → your self-obs stack
gcx resources push -p deploy/grafana/self-obs --context self-obs-stack
# Product recording rules → your product stack
gcx resources push -p deploy/grafana/product --context product-stack
Add --dry-run to preview changes without applying them.
Push the Folder before the Dashboard
The folder.yaml manifest creates the genai-otel-bridge folder. Push the whole
self-obs/ directory (not individual files) so the Folder is created before the
Dashboard — otherwise the Dashboard creation fails with a 404 on the missing folder.
To reconcile drift after out-of-band changes:
Self-observability dashboard¶
self-obs/dashboard-self-obs.yaml — genai-otel-bridge — self-observability
A tabbed, dynamic dashboard (v2 TabsLayout + responsive AutoGridLayout) covering the
bridge's own health across all signals. The manifest is generated from gen_dashboard.py;
edit the generator and run make gen-dashboard to regenerate it.
Dashboard tabs¶
| Tab | What it shows |
|---|---|
| Overview / SLO | At-a-glance badges: loops-healthy, leader present, replicas, worst freshness ratio, max window lag, fatal emit errors; freshness-by-loop + throughput |
| Liveness & leadership | Window lag, last-success age vs each loop's own baseline, replicas over time, per-loop freshness gauge (repeats per $loop) |
| Emit pipeline | Emitted samples/logs, emit errors by kind, queue depth, samples skipped/capped, guard dropped, buckets revised after settle |
| Upstream source health | Request rate/latency/error-ratio per target, auth errors, source-graph-unavailable |
| Cardinality & governance | New-label-value growth, guard drops, DPM capping |
| Logs | The poller's own stdout logs (not the high-volume product logs) |
| Profiling | The poller's own Pyroscope profiles: CPU, heap in-use, goroutines, CPU flame graph |
Self-relative freshness¶
The freshness panels colour on a self-relative staleness ratio (genai-otel-bridge:freshness_ratio)
rather than a flat threshold. Each loop's current staleness is divided by its own trailing-6h
p90 baseline: < 1.5 green, 1.5–2 yellow, > 2 red. This prevents false positives on the
log-export loops, which legitimately sawtooth to tens of minutes, while keeping full sensitivity
for fast snapshot loops. The freshness and upstream-ratio panels depend on the recording rules
being deployed.
"No data" on some panels¶
emitted, emitted_logs, last_success_timestamp, and window_lag are only recorded on
a successful emit (the watermark must leave zero). Panels for these metrics show "No data"
until the loop has committed at least once — this is a real signal that emit is failing, not
a broken panel.
Dashboard variables¶
| Variable | Datasource type | Default UID |
|---|---|---|
${datasource} |
Prometheus | grafanacloud-prom |
${loki} |
Loki | grafanacloud-logs |
${pyroscope} |
Pyroscope | grafanacloud-profiles |
${loop} |
— | multi-value loop filter |
The dashboard is stack-agnostic. Select the correct datasource UIDs for your stack when provisioning.
Self-obs recording rules¶
| Rule | What it computes |
|---|---|
genai-otel-bridge:last_success_age:seconds |
time() − max by (loop)(last_success_timestamp_seconds) — staleness in seconds per loop |
genai-otel-bridge:last_success_age:baseline6h |
Trailing 6h p90 of last_success_age:seconds per loop — the self-relative staleness baseline |
genai-otel-bridge:freshness_ratio |
last_success_age:seconds / last_success_age:baseline6h — the ratio the dashboard colours on |
genai-otel-bridge:upstream_error_ratio:5m |
Error ratio per upstream target — drives the upstream-health panel and GenaiOtelBridgeUpstreamErrorBudget alert |
genai-otel-bridge:window_truncated:rate5m |
Rate of window-truncation events per loop — drives GenaiOtelBridgeWindowTruncatedDroppingRecords |
genai-otel-bridge:scrape_healthy |
1 when the leader's last successful collect+emit is recent |
genai-otel-bridge:scrape_present |
1 if last_success_timestamp_seconds was exported in the last 15m |
Product recording rules¶
The product/ directory ships recording rules over portkey_api_* and langsmith_* signals.
These encode the correct query patterns for per-bucket gauges (using sum_over_time not
rate/increase).
| Rule | What it computes |
|---|---|
portkey:requests:sum_5m |
sum_over_time(portkey_api_requests[5m]) — total requests over 5m windows |
portkey:error_ratio:5m |
Error ratio derived from portkey_api_errors and portkey_api_requests |
langsmith:runs:sum_5m |
Total runs across all sessions summed per environment |
langsmith:cost_usd:sum_5m |
Total cost across sessions |
langsmith:tokens:sum_5m |
Total tokens across sessions |
Per-bucket gauge semantics
portkey_api_* metrics are per-bucket gauges, not counters. An instant query between
emit cycles may read as absent — use last_over_time(...[20m]) to see the last known value.
Always use sum_over_time to aggregate; never use rate() or increase() on these metrics.
genai_otel_bridge_source_graph_unavailable_total and
genai_otel_bridge_upstream_request_duration_seconds are counters — rate() is correct there.
Grafana-staff prerequisites¶
Before deploying to production:
- GS2 — raise Mimir
out_of_order_time_windowandreject_old_samples_max_ageto match your tolerated downtime SLA, or long-outage backfill will be rejected. - GS3 — exempt
genai_otel_bridge_*from Adaptive Metrics aggregation, or the staleness and error signals that detect poller failure will be rolled up and the health rules will become unreliable.
See also¶
- Alerts & Runbooks — the eleven bundled alert rules with runbooks
- Telemetry reference — full signal catalogue
- Troubleshooting — common failure modes