Alerts¶
sf2loki ships a hand-authored Grafana-managed alert + recording rule
pack under
deploy/grafana/rules/
(rules.alerting.grafana.app/v0alpha1) — recording/ and alerting/, one
resource per file. There is no generator; edit the YAML directly.
Only severity and service labels are set on every rule — routing to a
contact point is left to your notification policy.
Recording rules¶
These evaluate a LogQL query against Loki every 60s and record the result as
a Prometheus series (via targetDatasourceUID: grafanacloud-prom), so
dashboards and alerts can read a cheap metric instead of re-scanning logs.
| Rule | Recorded metric | Source query |
|---|---|---|
sf2loki-rec-login-failures-5m |
sf2loki_login_failures:count5m |
Failed Login events (LOGIN_STATUS ≠ LOGIN_NO_ERROR), 5m |
sf2loki-rec-apex-callout-errors-5m |
sf2loki_apex_callout_errors:count5m |
ApexCallout events with SUCCESS="0", 5m |
sf2loki-rec-events-5m |
sf2loki_events:count5m |
All events, by (source, event_type), 5m |
sf2loki-rec-api-usage-5m |
sf2loki_api_usage:count5m |
ApiTotalUsage events, by (API_FAMILY), 5m |
Alert rules¶
| Rule | Severity | Signal | Datasource |
|---|---|---|---|
sf2loki-login-failure-spike |
warning | More than 10 failed Salesforce logins in the last 10m | Loki (grafanacloud-logs) |
sf2loki-apex-callout-error-rate |
warning | ApexCallout error rate above 10% over the last 10m |
Loki (grafanacloud-logs) |
sf2loki-api-limit-low |
critical | Lowest Salesforce org-limit headroom below 10% (sf2loki_salesforce_limit_remaining / sf2loki_salesforce_limit_max) |
Prometheus/OTLP (grafanacloud-prom) |
sf2loki-ingest-lag-high |
warning | p95 ingest lag above 15m (900s), sustained 10m (sf2loki_ingest_lag_seconds_bucket) |
Prometheus/OTLP (grafanacloud-prom) |
sf2loki-loki-push-failing |
critical | Loki push failure rate above 5% over 5m (sf2loki_loki_push_total) |
Prometheus/OTLP (grafanacloud-prom) |
sf2loki-no-recent-push |
critical | No successful Loki push in the last 10m (sf2loki_last_push_success_timestamp_seconds) |
Prometheus/OTLP (grafanacloud-prom) |
sf2loki-leader-anomaly |
critical | Active-leader count sum(sf2loki_leader) not exactly 1 — 0 = leaderless gap, 2+ = split-brain (sf2loki_leader) |
Prometheus/OTLP (grafanacloud-prom) |
sf2loki-login-failure-spike and sf2loki-apex-callout-error-rate query
Loki directly; the other connector-health alerts read the metrics
documented in Metrics, via the companion
sf2loki-connector-health.json
dashboard.
Connector-metric alerts need suffixed names + add_metric_suffixes
sf2loki-api-limit-low, sf2loki-ingest-lag-high, sf2loki-loki-push-failing,
and sf2loki-no-recent-push query the OpenTelemetry→Prometheus suffixed
metric names (sf2loki_loki_push_total, sf2loki_ingest_lag_seconds_bucket, …).
If you route metrics through your own Collector or Grafana Alloy instead of
Grafana Cloud's OTLP endpoint, add_metric_suffixes must stay enabled or
these rules go permanently NoData (mapped to Ok by noDataState: Ok,
so they fail silently rather than firing). See
Metric-name suffixes.
Datasource UIDs¶
Grafana-managed rules can't template datasources, so every rule embeds a
UID directly: grafanacloud-logs for Loki, grafanacloud-prom for
Prometheus — the Grafana Cloud defaults. On self-hosted Grafana, replace
both UIDs in each YAML file with your own before pushing.
Applying with gcx¶
Editing¶
Copy an existing file, change metadata.name and spec.title, and adjust
the expressions map — the alerting condition is the leaf threshold
expression (C) evaluated against the query result (A). Keep the
datasource-UID and metric-suffix caveats above in mind. Re-validate and push
after any change.
When an alert fires¶
If a checkpoint is the poison record blocking the pipeline queue (a likely
cause behind sf2loki-no-recent-push or sf2loki-ingest-lag-high), see
State & checkpoints for how to inspect and advance
it with sf2loki state. For active-passive deployments, sf2loki_leader
(see Metrics) shows which instance currently holds the lease —
check High availability if pushes stop
after a failover.