Metrics Reference¶

sf2loki emits its own operational metrics via OTLP — there is no Prometheus /metrics scrape endpoint, the connector is OTLP-native and pushes on an interval. This page is the canonical reference for every instrument, generated by reading src/sf2loki/obs/metrics.py.

All instruments carry an org label when multi-org ingestion is enabled (for_org() stamps it on per-org components — token provider, Salesforce clients, sources, the limits poller); single-org deployments omit it. Names below are exactly as created in code — unsuffixed. See Metric-name suffixes before wiring dashboards or alerts against them.

Enabling metrics¶

Set service.telemetry.enabled: true and service.telemetry.endpoint:

service:
  telemetry:
    enabled: true
    endpoint: https://otlp-gateway-<zone>.grafana.net/otlp/v1/metrics
    # endpoint: http://alloy:4318/v1/metrics   # local Alloy instead of Grafana Cloud

Grafana Cloud — basic auth defaults to the Loki sink's tenant_id / auth_token (one stack credential covers both Loki and OTLP).
Local/in-cluster Alloy — set service.telemetry.auth: none for an unauthenticated collector.
Enable Salesforce org-limit gauges (API usage, storage, streaming events) with salesforce.limits.enabled: true.

See Configuration for the full service.telemetry field reference.

Metric-name suffixes¶

Keep add_metric_suffixes on, or dashboards/alerts go blank

Instruments are created unsuffixed in code (e.g. sf2loki_events_ingested), but the OpenTelemetry→Prometheus bridge appends _total (counters), _bucket/_count/_sum (histograms) on export — e.g. sf2loki_events_ingested_total, sf2loki_ingest_lag_seconds_bucket. Grafana Cloud's OTLP endpoint adds these suffixes by default. If you route metrics through your own OpenTelemetry Collector or Grafana Alloy instead, keep add_metric_suffixes (a.k.a. AddMetricSuffixes) enabled on the Prometheus exporter — with it off, the connector-health dashboard and every connector metric alert query the suffixed names and return nothing.

Gauges are queried by their bare name (no suffix) in both cases.

Instruments¶

Metric name	Type	Labels	Meaning
`sf2loki_events_ingested`	Counter	(none)	Total events ingested from Salesforce sources
`sf2loki_decode_errors`	Counter	(none)	Total Avro/payload decode errors
`sf2loki_loki_push`	Counter	(none)	Total Loki push attempts
`sf2loki_loki_entries_dropped`	Counter	(none)	Loki entries dropped as undeliverable (permanent errors), per reason
`sf2loki_loki_push_duration_seconds`	Histogram	(none)	Duration of Loki push requests in seconds
`sf2loki_loki_bytes_pushed`	Counter	(none)	Total bytes pushed to Loki
`sf2loki_last_push_success_timestamp_seconds`	Gauge	(none)	Unix ts of the last successful Loki push
`sf2loki_ingest_lag_seconds`	Histogram	per event type	`now - event EventDate` (the ingest-lag SLI). Bucket boundaries extend to 24h (`1, 5, 15, 60, 300, 900, 1800, 3600, 7200, 14400, 28800, 86400`) because EventLogFile lag is legitimately 3-6h
`sf2loki_last_replay_commit_timestamp_seconds`	Gauge	per topic	Unix ts of the last committed replay_id
`sf2loki_pubsub_pending_credits`	Gauge	per topic	Pending Pub/Sub flow-control credits
`sf2loki_pubsub_reconnects`	Counter	per topic	Total Pub/Sub stream reconnects
`sf2loki_pubsub_replay_fallbacks`	Counter	per topic	Subscriptions restarted with a fallback replay preset after an invalid/expired replay id
`sf2loki_pubsub_stream_stalls`	Counter	per topic	Pub/Sub streams force-reconnected by the keepalive watchdog
`sf2loki_salesforce_api_throttled`	Counter	per api	Salesforce REST calls rejected with `REQUEST_LIMIT_EXCEEDED`
`sf2loki_pubsub_stream_up`	Gauge	per topic	1 while a topic's subscribe stream is connected and healthy, 0 while erroring/reconnecting
`sf2loki_watermark_timestamp_seconds`	Gauge	per source/object	Unix ts of the current polling watermark
`sf2loki_watermark_stalls`	Counter	per source/object	Poll cycles that made no progress because a full page shared one timestamp boundary (a compound-cursor stall would silently drop newer data)
`sf2loki_pubsub_decode_stalls`	Counter	per topic	Topics stuck making zero progress — consecutive batches at 100% decode failure, or repeated schema-fetch failures
`sf2loki_checkpoint_load_errors`	Counter	per source	Stored checkpoint values that failed to load/decode and fell back to a replay preset / lookback
`sf2loki_auth_refreshes`	Counter	(none)	Total Salesforce OAuth token refreshes
`sf2loki_auth_errors`	Counter	(none)	Total Salesforce auth errors
`sf2loki_schema_cache_size`	Gauge	(none)	Avro schemas in the codec cache
`sf2loki_queue_depth`	Gauge	(none)	Depth of the internal event queue
`sf2loki_eventlogfile_files_processed`	Counter	per event type	EventLogFile records downloaded and parsed
`sf2loki_eventlogfile_rows_ingested`	Counter	per event type	CSV rows ingested from EventLogFiles
`sf2loki_eventlogfile_download_bytes`	Counter	(none)	Bytes downloaded from EventLogFile LogFile endpoints
`sf2loki_eventlogfile_download_errors`	Counter	per reason	EventLogFile listing/download errors
`sf2loki_soql_poll_errors`	Counter	per source and object/event type	Failed SOQL poll cycles
`sf2loki_apexlog_logs_ingested`	Counter	(none)	ApexLog debug logs ingested
`sf2loki_apexlog_download_bytes`	Counter	(none)	Bytes downloaded from ApexLog Body endpoints
`sf2loki_apexlog_bodies_skipped`	Counter	per reason	ApexLog bodies not shipped (over `max_body_bytes` or download error)
`sf2loki_apexlog_download_errors`	Counter	per reason	ApexLog body download errors
`sf2loki_timestamp_fallbacks`	Counter	per source	Entries whose event timestamp was missing/unparseable and a fallback was used
`sf2loki_lines_truncated`	Counter	per source	Log lines truncated to `max_line_bytes`
`sf2loki_entries_sampled_out`	Counter	per source and event type	Rows/events dropped by deterministic per-type sampling
`sf2loki_rows_filtered`	Counter	per source and rule	Rows dropped by transform `drop_row` rules
`sf2loki_egress_budget_used_bytes`	Gauge	(none)	Pre-compression bytes counted against the daily egress budget (current UTC day)
`sf2loki_egress_paused`	Gauge	(none)	1 while pushes are paused by the daily byte budget, else 0
`sf2loki_eventlogfile_cycle_seconds`	Gauge	(none)	Wall-clock duration of the last EventLogFile poll cycle
`sf2loki_leader`	Gauge	(none)	1 while this instance holds leadership (or runs standalone), else 0
`sf2loki_salesforce_limit_max`	Gauge	per limit_name	Salesforce org limit maximum (requires `salesforce.limits.enabled: true`)
`sf2loki_salesforce_limit_remaining`	Gauge	per limit_name	Salesforce org limit remaining (requires `salesforce.limits.enabled: true`)
`sf2loki_salesforce_limits_poll_errors`	Counter	(none)	Total Salesforce `/limits` poll errors
`sf2loki_build_info`	Gauge	`version`	Build metadata; value is always 1

All labels above are in addition to the org label added under multi-org ingestion (see above); "(none)" means no labels other than a possible org.

Where these are used¶

Dashboards — the bundled sf2loki-connector-health dashboard graphs these metrics (suffixed) from a Prometheus datasource.
Alerts — the Grafana-managed alert pack thresholds several of these (ingest lag, Loki push failure rate, org-limit headroom).
State & checkpoints — sf2loki_watermark_timestamp_seconds, sf2loki_last_replay_commit_timestamp_seconds, and sf2loki_checkpoint_load_errors are the checkpoint-health signals.
High availability — sf2loki_leader reflects lease ownership under the active-passive coordinator.