Skip to content

Dashboards

genai-otel-bridge ships Grafana resources as gcx-native resource manifests under deploy/grafana/. Resources are split by role — self-observability (the bridge's own health) and product telemetry (Portkey, LangSmith signals). Push each role to the appropriate stack.


Layout

deploy/grafana/
├── self-obs/     # genai_otel_bridge_* signals — push to your self-obs stack
│   ├── folder.yaml
│   ├── dashboard-self-obs.yaml
│   ├── alertrule-*.yaml          (11 alert rules)
│   └── recordingrule-*.yaml      (7 recording rules)
└── product/      # portkey_api_* and langsmith_* signals — push to your product stack
    ├── folder.yaml
    └── recordingrule-*.yaml      (5 recording rules)

Applying resources

Resources are applied with the gcx CLI. The --context flag selects the target Grafana stack. Substitute your own stack context names for the placeholders below:

# Self-obs rules, alerts, and dashboard → your self-obs stack
gcx resources push -p deploy/grafana/self-obs --context self-obs-stack

# Product recording rules → your product stack
gcx resources push -p deploy/grafana/product --context product-stack

Add --dry-run to preview changes without applying them.

Push the Folder before the Dashboard

The folder.yaml manifest creates the genai-otel-bridge folder. Push the whole self-obs/ directory (not individual files) so the Folder is created before the Dashboard — otherwise the Dashboard creation fails with a 404 on the missing folder.

To reconcile drift after out-of-band changes:

gcx resources pull <selector> -p deploy/grafana/self-obs -o yaml

Self-observability dashboard

self-obs/dashboard-self-obs.yamlgenai-otel-bridge — self-observability

A tabbed, dynamic dashboard (v2 TabsLayout + responsive AutoGridLayout) covering the bridge's own health across all signals. The manifest is generated from gen_dashboard.py; edit the generator and run make gen-dashboard to regenerate it.

Dashboard tabs

Tab What it shows
Overview / SLO At-a-glance badges: loops-healthy, leader present, replicas, worst freshness ratio, max window lag, fatal emit errors; freshness-by-loop + throughput
Liveness & leadership Window lag, last-success age vs each loop's own baseline, replicas over time, per-loop freshness gauge (repeats per $loop)
Emit pipeline Emitted samples/logs, emit errors by kind, queue depth, samples skipped/capped, guard dropped, buckets revised after settle
Upstream source health Request rate/latency/error-ratio per target, auth errors, source-graph-unavailable
Cardinality & governance New-label-value growth, guard drops, DPM capping
Logs The poller's own stdout logs (not the high-volume product logs)
Profiling The poller's own Pyroscope profiles: CPU, heap in-use, goroutines, CPU flame graph

Self-relative freshness

The freshness panels colour on a self-relative staleness ratio (genai-otel-bridge:freshness_ratio) rather than a flat threshold. Each loop's current staleness is divided by its own trailing-6h p90 baseline: < 1.5 green, 1.5–2 yellow, > 2 red. This prevents false positives on the log-export loops, which legitimately sawtooth to tens of minutes, while keeping full sensitivity for fast snapshot loops. The freshness and upstream-ratio panels depend on the recording rules being deployed.

"No data" on some panels

emitted, emitted_logs, last_success_timestamp, and window_lag are only recorded on a successful emit (the watermark must leave zero). Panels for these metrics show "No data" until the loop has committed at least once — this is a real signal that emit is failing, not a broken panel.

Dashboard variables

Variable Datasource type Default UID
${datasource} Prometheus grafanacloud-prom
${loki} Loki grafanacloud-logs
${pyroscope} Pyroscope grafanacloud-profiles
${loop} multi-value loop filter

The dashboard is stack-agnostic. Select the correct datasource UIDs for your stack when provisioning.


Self-obs recording rules

Rule What it computes
genai-otel-bridge:last_success_age:seconds time() − max by (loop)(last_success_timestamp_seconds) — staleness in seconds per loop
genai-otel-bridge:last_success_age:baseline6h Trailing 6h p90 of last_success_age:seconds per loop — the self-relative staleness baseline
genai-otel-bridge:freshness_ratio last_success_age:seconds / last_success_age:baseline6h — the ratio the dashboard colours on
genai-otel-bridge:upstream_error_ratio:5m Error ratio per upstream target — drives the upstream-health panel and GenaiOtelBridgeUpstreamErrorBudget alert
genai-otel-bridge:window_truncated:rate5m Rate of window-truncation events per loop — drives GenaiOtelBridgeWindowTruncatedDroppingRecords
genai-otel-bridge:scrape_healthy 1 when the leader's last successful collect+emit is recent
genai-otel-bridge:scrape_present 1 if last_success_timestamp_seconds was exported in the last 15m

Product recording rules

The product/ directory ships recording rules over portkey_api_* and langsmith_* signals. These encode the correct query patterns for per-bucket gauges (using sum_over_time not rate/increase).

Rule What it computes
portkey:requests:sum_5m sum_over_time(portkey_api_requests[5m]) — total requests over 5m windows
portkey:error_ratio:5m Error ratio derived from portkey_api_errors and portkey_api_requests
langsmith:runs:sum_5m Total runs across all sessions summed per environment
langsmith:cost_usd:sum_5m Total cost across sessions
langsmith:tokens:sum_5m Total tokens across sessions

Per-bucket gauge semantics

portkey_api_* metrics are per-bucket gauges, not counters. An instant query between emit cycles may read as absent — use last_over_time(...[20m]) to see the last known value. Always use sum_over_time to aggregate; never use rate() or increase() on these metrics. genai_otel_bridge_source_graph_unavailable_total and genai_otel_bridge_upstream_request_duration_seconds are counters — rate() is correct there.


Grafana-staff prerequisites

Before deploying to production:

  • GS2 — raise Mimir out_of_order_time_window and reject_old_samples_max_age to match your tolerated downtime SLA, or long-outage backfill will be rejected.
  • GS3 — exempt genai_otel_bridge_* from Adaptive Metrics aggregation, or the staleness and error signals that detect poller failure will be rolled up and the health rules will become unreliable.

See also