Skip to content

Troubleshooting

This page covers the most common tailscale2otel problems, their root causes, and concrete fixes. All config keys reference the full key path; see Configuration for defaults and env-var equivalents.


Authentication failures

API key stopped working

Cause. A personal API key (tailscale.auth.method: apikey) expires in at most 90 days and is bound to the creating user. If that user is suspended or removed from the tailnet, the key is immediately revoked.

Fix. Switch to OAuth, which issues short-lived, auto-refreshing tokens that are not tied to any user and never expire on a fixed schedule:

tailscale:
  auth:
    method: oauth
    oauth:
      client_id: ""      # set via TS2OTEL_TAILSCALE__AUTH__OAUTH__CLIENT_ID
      client_secret: ""  # set via TS2OTEL_TAILSCALE__AUTH__OAUTH__CLIENT_SECRET
      scopes:
        - all:read

If you must keep method: apikey, the startup log will always contain a WARN advisory — that is expected and intentional.

401 responses logged at ERROR with OAuth

Cause. A 401 returned while OAuth is active (or an OAuth token-exchange failure with 401/403) is logged at ERROR by the API transport. This means the OAuth client credentials are wrong, the client has been deleted, or it lacks the required scopes.

Fix. Verify the client_id and client_secret match an active Tailscale OAuth client, and check that the client carries at least the all:read scope. If you use streaming.auto_configure, the log_streaming scope is also required:

tailscale:
  auth:
    oauth:
      scopes:
        - all:read
        - log_streaming   # only needed for streaming.auto_configure

Tip

Non-401 4xx responses (e.g. 403 from the flowlogs endpoint on an idle tailnet) are not logged as errors by the transport — they surface only as a collector WARN "collector failed" to avoid per-tick spam.


No data arriving

Bare gateway URL returns 404 silently

Cause. When otlp.protocol: http, tailscale2otel calls <endpoint>/v1/metrics and <endpoint>/v1/logs — it appends the per-signal paths for you. If you set otlp.endpoint to a bare gateway URL that does not end with /otlp (e.g. https://otlp-gateway-prod-us-central-0.grafana.net instead of …/otlp), those paths land at the wrong base and the gateway returns 404. Because the exporter sees the 404 as a successful HTTP exchange it may not raise an obvious error.

Fix. Set otlp.endpoint to the base URL ending in /otlp:

otlp:
  endpoint: https://otlp-gateway-prod-us-central-0.grafana.net/otlp

The per-signal suffixes (/v1/metrics, /v1/logs) are appended automatically. See Configuration for the Grafana Cloud default.

Wrong otlp.protocol

Cause. Setting otlp.protocol: stdout prints all signals to the console instead of sending them to a backend. This is correct for local debugging but will leave your metrics store empty.

Fix. Set the protocol to match your backend transport:

otlp:
  protocol: http   # or grpc

Tip

protocol: stdout is deliberate for local debugging without a backend — run with it to confirm signals are emitted before pointing at a real endpoint.


Flow-log / audit-log double-counting

Cause. flowlogs and auditlogs each have a source field that controls whether records come from the API poller, the Splunk-HEC stream receiver, or both. Setting source: both — or running the streaming receiver while a collector still polls the same log type — feeds the same records through the same processor twice. Cross-source de-duplication is a best-effort failsafe and does not guarantee exact-once delivery. The exporter logs a startup WARN when this condition is detected.

Fix. Pick exactly one ingestion path per log type:

collectors:
  flowlogs:
    source: poll     # or stream — not both
  auditlogs:
    source: poll     # or stream — not both

See Streaming & Webhooks for when to prefer stream over poll.


Flow/audit enrichment shows unknown or external

Cause. IP-to-device-name resolution for flow logs and audit records depends on the in-memory device-enrichment cache, which is populated by the devices collector. If devices is disabled, no cache is ever built and every address falls back to unknown (tailnet nodes) or external (off-tailnet addresses).

Fix. Ensure the devices collector is enabled (it is on by default):

collectors:
  devices:
    enabled: true

The tailscale2otel.enrich.cache_size gauge (→ tailscale2otel_enrich_cache_size_ratio) shows how many devices are currently in the cache; tailscale2otel.enrich.cache_age (→ tailscale2otel_enrich_cache_age_seconds) shows how stale it is.


Cardinality overflow — series silently dropped

Cause. Every metric instrument is bounded by cardinality.metric_limit (default 10000). When the number of distinct active series for a single instrument reaches this cap, the OTLP SDK collapses all further series into a single {otel_metric_overflow="true"} series. Per-series detail is silently lost; only the overflow sentinel remains. The most common trigger is enabling per-port dimensions (cardinality.flow.source_port or cardinality.flow.destination_port) on a busy tailnet.

Diagnosis. Watch two self-observability signals:

  • tailscale2otel_series_overflowing_ratio{metric_name="..."}1 when the named metric hit the cap during the last export interval.
  • tailscale2otel_series_active{metric_name="..."} — the active series count, which pins at the cap when exceeded.
  • A series with label otel_metric_overflow="true" appearing in your metrics store (e.g. tailscale_network_io_bytes_total{otel_metric_overflow="true"}) is the direct indicator.
  • tailscale2otel_series_limit shows the configured cap (emitted only when a positive limit is set).

Fix. Either raise the cap or reduce cardinality:

cardinality:
  metric_limit: 50000        # raise the per-instrument series cap

  flow:
    source_port: false        # disable per-port dimensions (largest driver)
    destination_port: false
    metrics_mode: rollup      # use bounded top-N rollup instead of per-connection raw families
    rollup_top_n: 500         # keep only the busiest N src/dst pairs

Setting cardinality.metric_limit: 0 removes the cap entirely, at the cost of unbounded memory growth under high-cardinality conditions.


Node-metrics label collision (tailscale_node vs. instance)

Cause. The node-metrics scraper adds a tailscale_node label to every forwarded tailscaled series to identify which node the series came from. Deliberately, it does not use instance: on Grafana Cloud, the OTLP-to-Prometheus translation promotes the exporter's own service.instance.id resource attribute to the instance label. If the per-node label were also called instance, it would overwrite the collector-host value and collapse every scraped node's series onto the same instance, making per-node queries impossible.

If you see tailscale_node_up_ratio missing from your store, or all forwarded tailscaled_* series sharing the same instance label value rather than being distinguished by node name, check that your dashboards or recording rules query on tailscale_node, not instance.

Fix. No configuration change is required — the label is tailscale_node by design. Update any dashboard queries or alert rules that reference instance for these series to use tailscale_node instead.

Tip

The tailscale.node.up gauge (→ tailscale_node_up_ratio) is the canonical per-node health signal. It carries the tailscale_node label and is always emitted regardless of metric_allow/metric_deny filters. Use it for scrape-health alerting.