Troubleshooting¶
This page covers the most common tailscale2otel problems, their root causes, and concrete fixes.
All config keys reference the full key path; see Configuration for defaults and
env-var equivalents.
Authentication failures¶
API key stopped working¶
Cause. A personal API key (tailscale.auth.method: apikey) expires in at most 90 days and is
bound to the creating user. If that user is suspended or removed from the tailnet, the key is
immediately revoked.
Fix. Switch to OAuth, which issues short-lived, auto-refreshing tokens that are not tied to any user and never expire on a fixed schedule:
tailscale:
auth:
method: oauth
oauth:
client_id: "" # set via TS2OTEL_TAILSCALE__AUTH__OAUTH__CLIENT_ID
client_secret: "" # set via TS2OTEL_TAILSCALE__AUTH__OAUTH__CLIENT_SECRET
scopes:
- all:read
If you must keep method: apikey, the startup log will always contain a WARN advisory — that is
expected and intentional.
401 responses logged at ERROR with OAuth¶
Cause. A 401 returned while OAuth is active (or an OAuth token-exchange failure with 401/403) is logged at ERROR by the API transport. This means the OAuth client credentials are wrong, the client has been deleted, or it lacks the required scopes.
Fix. Verify the client_id and client_secret match an active Tailscale OAuth client, and
check that the client carries at least the all:read scope. If you use
streaming.auto_configure, the log_streaming scope is also required:
tailscale:
auth:
oauth:
scopes:
- all:read
- log_streaming # only needed for streaming.auto_configure
Tip
Non-401 4xx responses (e.g. 403 from the flowlogs endpoint on an idle tailnet) are not logged as errors by the transport — they surface only as a collector WARN "collector failed" to avoid per-tick spam.
No data arriving¶
Bare gateway URL returns 404 silently¶
Cause. When otlp.protocol: http, tailscale2otel calls <endpoint>/v1/metrics and
<endpoint>/v1/logs — it appends the per-signal paths for you. If you set otlp.endpoint to a
bare gateway URL that does not end with /otlp (e.g. https://otlp-gateway-prod-us-central-0.grafana.net
instead of …/otlp), those paths land at the wrong base and the gateway returns 404. Because the
exporter sees the 404 as a successful HTTP exchange it may not raise an obvious error.
Fix. Set otlp.endpoint to the base URL ending in /otlp:
The per-signal suffixes (/v1/metrics, /v1/logs) are appended automatically. See
Configuration for the Grafana Cloud default.
Wrong otlp.protocol¶
Cause. Setting otlp.protocol: stdout prints all signals to the console instead of sending
them to a backend. This is correct for local debugging but will leave your metrics store empty.
Fix. Set the protocol to match your backend transport:
Tip
protocol: stdout is deliberate for local debugging without a backend — run with it to
confirm signals are emitted before pointing at a real endpoint.
Flow-log / audit-log double-counting¶
Cause. flowlogs and auditlogs each have a source field that controls whether records come
from the API poller, the Splunk-HEC stream receiver, or both. Setting source: both — or running
the streaming receiver while a collector still polls the same log type — feeds the same records
through the same processor twice. Cross-source de-duplication is a best-effort failsafe and does not
guarantee exact-once delivery. The exporter logs a startup WARN when this condition is detected.
Fix. Pick exactly one ingestion path per log type:
collectors:
flowlogs:
source: poll # or stream — not both
auditlogs:
source: poll # or stream — not both
See Streaming & Webhooks for when to prefer stream over poll.
Flow/audit enrichment shows unknown or external¶
Cause. IP-to-device-name resolution for flow logs and audit records depends on the in-memory
device-enrichment cache, which is populated by the devices collector. If devices is disabled,
no cache is ever built and every address falls back to unknown (tailnet nodes) or external
(off-tailnet addresses).
Fix. Ensure the devices collector is enabled (it is on by default):
The tailscale2otel.enrich.cache_size gauge (→ tailscale2otel_enrich_cache_size_ratio) shows how
many devices are currently in the cache; tailscale2otel.enrich.cache_age (→
tailscale2otel_enrich_cache_age_seconds) shows how stale it is.
Cardinality overflow — series silently dropped¶
Cause. Every metric instrument is bounded by cardinality.metric_limit (default 10000).
When the number of distinct active series for a single instrument reaches this cap, the OTLP SDK
collapses all further series into a single {otel_metric_overflow="true"} series. Per-series detail
is silently lost; only the overflow sentinel remains. The most common trigger is enabling per-port
dimensions (cardinality.flow.source_port or cardinality.flow.destination_port) on a busy tailnet.
Diagnosis. Watch two self-observability signals:
tailscale2otel_series_overflowing_ratio{metric_name="..."}—1when the named metric hit the cap during the last export interval.tailscale2otel_series_active{metric_name="..."}— the active series count, which pins at the cap when exceeded.- A series with label
otel_metric_overflow="true"appearing in your metrics store (e.g.tailscale_network_io_bytes_total{otel_metric_overflow="true"}) is the direct indicator. tailscale2otel_series_limitshows the configured cap (emitted only when a positive limit is set).
Fix. Either raise the cap or reduce cardinality:
cardinality:
metric_limit: 50000 # raise the per-instrument series cap
flow:
source_port: false # disable per-port dimensions (largest driver)
destination_port: false
metrics_mode: rollup # use bounded top-N rollup instead of per-connection raw families
rollup_top_n: 500 # keep only the busiest N src/dst pairs
Setting cardinality.metric_limit: 0 removes the cap entirely, at the cost of unbounded memory
growth under high-cardinality conditions.
Node-metrics label collision (tailscale_node vs. instance)¶
Cause. The node-metrics scraper adds a tailscale_node label to every forwarded tailscaled
series to identify which node the series came from. Deliberately, it does not use instance:
on Grafana Cloud, the OTLP-to-Prometheus translation promotes the exporter's own
service.instance.id resource attribute to the instance label. If the per-node label were also
called instance, it would overwrite the collector-host value and collapse every scraped node's
series onto the same instance, making per-node queries impossible.
If you see tailscale_node_up_ratio missing from your store, or all forwarded tailscaled_*
series sharing the same instance label value rather than being distinguished by node name, check
that your dashboards or recording rules query on tailscale_node, not instance.
Fix. No configuration change is required — the label is tailscale_node by design. Update any
dashboard queries or alert rules that reference instance for these series to use tailscale_node
instead.
Tip
The tailscale.node.up gauge (→ tailscale_node_up_ratio) is the canonical per-node health
signal. It carries the tailscale_node label and is always emitted regardless of
metric_allow/metric_deny filters. Use it for scrape-health alerting.