Skip to content

Troubleshooting

Common operational issues and their fixes. Start with sf2loki doctor for anything that smells like an auth, permissions, or connectivity problem — it isolates which layer is failing before you go digging.

Why do I see periodic Pub/Sub reconnects every N minutes?

A sawtooth on sf2loki_pubsub_reconnects / sf2loki_auth_refreshes at a regular interval is almost always your org's session timeout, not a fault.

This is expected — tune the session timeout, don't chase it as a bug

Neither Salesforce OAuth flow (jwt_bearer or client_credentials) returns expires_in or a refresh token — the access token's real lifetime is the org's session timeout (Setup → Session Settings), which can be as short as 15 minutes. sf2loki handles expiry reactively: it re-mints the token on a 401/UNAUTHENTICATED and resubscribes from the stored replay_id, so there's no data loss, just reconnect churn. To reduce the churn, raise the integration user's session timeout (a profile-level Session Settings override works) and set salesforce.token_ttl to match, so sf2loki proactively re-mints before Salesforce kills the session.

The container crash-loops with a permission error on startup

This is almost always a secret file uid mismatch.

Secret files must be readable by uid 10001

The container runs as a non-root user, uid 10001. Files under salesforce.private_key_file, salesforce.client_secret_file, sink.loki.auth_token_file, and similar *_file paths must be readable by that uid or the service fails fast at startup with an actionable "permission denied" error. A root-owned chmod 0600 key file — the natural way to store a private key — is exactly the trap. Fix with:

chmod 640 secrets/*        # or: chown the files to uid 10001

The same applies to the checkpoint state directory, but in the other direction — it must be writable by uid 10001: mkdir -p state && chmod 770 state && chown 10001 state (not a permissive 777).

My load balancer / orchestrator keeps restarting a healthy standby

You've pointed a liveness/restart check at /readyz instead of /healthz.

/readyz is readiness, not liveness — never wire it to a restart policy

/healthz is liveness: 200 whenever the process is up, even mid-startup or while standing by as an HA standby. /readyz is readiness: 200 only once auth has resolved and the pipeline is actively running, and it degrades to 503 if Loki pushes have failed continuously past service.unready_after_sink_failing. On an active-passive HA pair the standby's /readyz is 503 forever, by design — it never becomes the leader until the active instance fails. A Kubernetes livenessProbe, an ECS task-level healthCheck, or a Docker HEALTHCHECK pointed at /readyz restart-loops the standby continuously and defeats failover. Use /readyz only for routing decisions (a Kubernetes readinessProbe, an ECS target-group health check) and /healthz for anything that can restart the process. See High Availability for the full readiness-vs-liveness split.

A dashboard panel is empty even though sf2loki is running

This is almost always the OpenTelemetry→Prometheus metric-name suffix, not a broken connector.

Check add_metric_suffixes before assuming metrics aren't flowing

Instruments are created unsuffixed in code (e.g. sf2loki_events_ingested), but appear in Prometheus/Grafana with the standard OTel→Prometheus suffixes — _total, _bucket, _count, _sum (e.g. sf2loki_events_ingested_total). Grafana Cloud's OTLP endpoint adds these by default. If you route metrics through your own OpenTelemetry Collector or Grafana Alloy instead, keep add_metric_suffixes (a.k.a. AddMetricSuffixes) enabled on the Prometheus exporter — with it off, every panel and alert rule that queries the suffixed name goes silently blank. See Metrics Reference for the full instrument list and which name each expects.

sf2loki doctor reports a short EventLogFile menu / no Hourly files

Expected on an org without the Shield/Event Monitoring add-on.

Free and dev orgs get a fixed EventLogFile subset

Without the Shield Event Monitoring add-on, an org produces only the free EventLogFile subset — Login, Logout, API Total Usage, Apex Unexpected Exception, and the CORS/CSP-violation and hostname-redirect types — at Daily interval only, 1-day retention. An event_types: ["*"] wildcard on such an org silently yields just those types; that's discovery working correctly, not a bug. interval: Hourly additionally needs the add-on's hourly opt-in — expect doctor's entitlement check to WARN (not FAIL) when a configured type or interval isn't available yet, since it may simply not have produced a file recently. The full ~70-type catalogue and RTEM streaming channels require the add-on.

sf2loki refuses to start with an OverlapError

You've enabled the same event category on more than one source.

Fix the overlap, or opt into it deliberately

Salesforce exposes the same underlying activity through multiple channels — for example /event/LoginEventStream (Pub/Sub), LoginEvent (SOQL-polled), and Login (EventLogFile) are the same records in three costumes. Ingesting one event category from more than one source double-counts it in Loki, so sf2loki's startup overlap guard (src/sf2loki/sources/overlap.py) refuses to start and lists every colliding category. Either disable all but one source for the affected category, or set sources.allow_overlap: true if the duplication is deliberate (for example, relying on Loki to drop byte-identical entries, or intentionally running both a lean real-time stream and a richer batch source side by side).