Troubleshooting¶
Common operational issues and their fixes. Start with sf2loki doctor
for anything that smells like an auth, permissions, or connectivity problem — it isolates which
layer is failing before you go digging.
Why do I see periodic Pub/Sub reconnects every N minutes?¶
A sawtooth on sf2loki_pubsub_reconnects / sf2loki_auth_refreshes at a regular interval is
almost always your org's session timeout, not a fault.
This is expected — tune the session timeout, don't chase it as a bug
Neither Salesforce OAuth flow (jwt_bearer or client_credentials) returns expires_in or a
refresh token — the access token's real lifetime is the org's session timeout (Setup →
Session Settings), which can be as short as 15 minutes. sf2loki handles expiry reactively: it
re-mints the token on a 401/UNAUTHENTICATED and resubscribes from the stored replay_id, so
there's no data loss, just reconnect churn. To reduce the churn, raise the integration user's
session timeout (a profile-level Session Settings override works) and set
salesforce.token_ttl to match, so sf2loki proactively re-mints before Salesforce kills the
session.
The container crash-loops with a permission error on startup¶
This is almost always a secret file uid mismatch.
Secret files must be readable by uid 10001
The container runs as a non-root user, uid 10001. Files under salesforce.private_key_file,
salesforce.client_secret_file, sink.loki.auth_token_file, and similar *_file paths must
be readable by that uid or the service fails fast at startup with an actionable "permission
denied" error. A root-owned chmod 0600 key file — the natural way to store a private key —
is exactly the trap. Fix with:
The same applies to the checkpoint state directory, but in the other direction — it must be
writable by uid 10001: mkdir -p state && chmod 770 state && chown 10001 state (not a
permissive 777).
My load balancer / orchestrator keeps restarting a healthy standby¶
You've pointed a liveness/restart check at /readyz instead of /healthz.
/readyz is readiness, not liveness — never wire it to a restart policy
/healthz is liveness: 200 whenever the process is up, even mid-startup or while standing by
as an HA standby. /readyz is readiness: 200 only once auth has resolved and the pipeline is
actively running, and it degrades to 503 if Loki pushes have failed continuously past
service.unready_after_sink_failing. On an active-passive HA pair the standby's /readyz is
503 forever, by design — it never becomes the leader until the active instance fails. A
Kubernetes livenessProbe, an ECS task-level healthCheck, or a Docker HEALTHCHECK pointed
at /readyz restart-loops the standby continuously and defeats failover. Use /readyz only
for routing decisions (a Kubernetes readinessProbe, an ECS target-group health check) and
/healthz for anything that can restart the process. See
High Availability for the full readiness-vs-liveness split.
A dashboard panel is empty even though sf2loki is running¶
This is almost always the OpenTelemetry→Prometheus metric-name suffix, not a broken connector.
Check add_metric_suffixes before assuming metrics aren't flowing
Instruments are created unsuffixed in code (e.g. sf2loki_events_ingested), but appear in
Prometheus/Grafana with the standard OTel→Prometheus suffixes — _total, _bucket, _count,
_sum (e.g. sf2loki_events_ingested_total). Grafana Cloud's OTLP endpoint adds these by
default. If you route metrics through your own OpenTelemetry Collector or Grafana Alloy
instead, keep add_metric_suffixes (a.k.a. AddMetricSuffixes) enabled on the Prometheus
exporter — with it off, every panel and alert rule that queries the suffixed name goes
silently blank. See Metrics Reference for the full instrument
list and which name each expects.
sf2loki doctor reports a short EventLogFile menu / no Hourly files¶
Expected on an org without the Shield/Event Monitoring add-on.
Free and dev orgs get a fixed EventLogFile subset
Without the Shield Event Monitoring add-on, an org produces only the free EventLogFile
subset — Login, Logout, API Total Usage, Apex Unexpected Exception, and the CORS/CSP-violation
and hostname-redirect types — at Daily interval only, 1-day retention. An
event_types: ["*"] wildcard on such an org silently yields just those types; that's
discovery working correctly, not a bug. interval: Hourly additionally needs the add-on's
hourly opt-in — expect doctor's entitlement check to WARN (not FAIL) when a configured
type or interval isn't available yet, since it may simply not have produced a file recently.
The full ~70-type catalogue and RTEM streaming channels require the add-on.
sf2loki refuses to start with an OverlapError¶
You've enabled the same event category on more than one source.
Fix the overlap, or opt into it deliberately
Salesforce exposes the same underlying activity through multiple channels — for example
/event/LoginEventStream (Pub/Sub), LoginEvent (SOQL-polled), and Login (EventLogFile) are
the same records in three costumes. Ingesting one event category from more than one source
double-counts it in Loki, so sf2loki's startup overlap guard (src/sf2loki/sources/overlap.py)
refuses to start and lists every colliding category. Either disable all but one source for the
affected category, or set sources.allow_overlap: true if the duplication is deliberate (for
example, relying on Loki to drop byte-identical entries, or intentionally running both a lean
real-time stream and a richer batch source side by side).