Skip to content

Incidents & Scenarios

The incident model separates what can happen (the failure-mode vocabulary declared by each construct/workload) from when it happens (activation via a scheduled incidents: block or a live control-plane call).

A failure mode is not a synthetic error injection in a naive sense — it shifts the shape engine's multiplier for specific constructs so that the affected signal families brown out realistically: latency histograms right-tail, error counters climb, pod restarts increase, connection gauges approach max. The rest of the estate stays healthy.

Scenarios: named failure bundles

A scenarios: block defines reusable named bundles of effects. Each effect declares a mode, a target instance, and an intensity in [0, 1]:

scenarios:
  - name: database_overload
    title: Database overload storm
    summary: >
      Aurora connections saturate and query latency spikes;
      the API backend browns out.
    effects:
      - { mode: connection_saturation, target: mine-app-db,  intensity: 0.9 }
      - { mode: slow_query_storm,      target: mine-app-db,  intensity: 0.7 }
      - { mode: latency_spike,         target: mine-api,     intensity: 0.5 }

  - name: cluster_instability
    title: Cluster pod crash-loop
    effects:
      - { mode: pod_crashloop, target: mine-prod-use1, intensity: 0.4 }
      - { mode: oom_kill,      target: mine-prod-use1, intensity: 0.3 }

Effect target addressing

The target field on an effect resolves to a specific declared instance:

Target form Meaning
mine-app-db Exact instance name — the named database, cluster, workload, or service node
database:* All database instances on the database axis
cluster:* All k8s cluster instances
workload:* All workload instances
service:* All app service nodes across all app workloads
cloud:* All cloud-scoped constructs (Bedrock, AgentCore, Portkey, etc.)
network:* All network-topology instances
omitted Valid only for single-axis modes — the mode's axis is inferred and all instances of that axis are targeted

The resolver validates every effect target against the blueprint's actual instance inventory at load time. An unknown name or an axis mismatch is a loud load error.

A mode that appears on more than one axis (e.g. lock_contention exists on both database and service) requires an explicit target — the empty-target shorthand is rejected at load to avoid ambiguous targeting.

Incidents: scheduled activations

An incidents: block schedules when a scenario or single-mode effect fires:

incidents:
  - scenario: database_overload
    at: "2026-07-01T14:00"   # ISO-8601 wall time (local to the blueprint's timezone)
    for: 30m
incidents:
  - kind: pod_crashloop
    target: mine-prod-use1
    at: "2026-07-01T11:00"
    for: 15m
    intensity: 0.6
incidents:
  # Non-prod ambient churn: fires every 30 minutes, lasts 8 minutes.
  - kind: oom_kill
    target: mine-dev-use1
    every: 30m
    for: 8m
    intensity: 0.15

scenario: and kind: are mutually exclusive in one entry, and an entry needs exactly one of at: or every:. at: accepts either a full wall-clock timestamp (fires once) or a bare HH:MM (fires daily at that time); every: fires on the given interval continuously for the lifetime of the process.

intensity in [0, 1]: 0 is a no-op; 1 is the maximum effect the construct physics implement. Fractional intensities affect a proportional fraction of pods or metric series.

Available failure modes

cluster axis

Mode Effect
oom_kill Containers OOM-killed; restart count climbs, status reason OOMKilled. intensity selects the fraction of pods affected.
pod_crashloop Pods crash-looping; restarts climb, phase Pending not Running. intensity selects fraction of pods.
node_not_ready A node flips NotReady; its pods go Pending.

database axis

Mode Effect
connection_saturation Active connections climb toward max.
replication_lag Replica falls behind primary.
lock_contention Lock waits climb. (Also fires the query_data_locks dbo11y op when active.)
slow_query_storm Query latency right-tail spikes.

workload axis (web_service)

Mode Effect
latency_spike Elevated request latency (up to 4× at full intensity).
error_burst Elevated 5xx error rate.
cpu_hotspot CPU concentrated in a hot frame (profiling flamegraph).
memory_leak Growing heap (profile sample values rise).
lock_contention Elevated mutex/block contention (profile values).
goroutine_leak Goroutine accumulation (profile values).

service axis (app workload — per-node)

Mode Effect
latency_storm Elevated latency on the targeted service node.
error_spike Elevated 5xx rate on the targeted service node.
throughput_drop Reduced throughput on the targeted service node.
fallback_storm Elevated gateway fallback rate on the targeted service node.
retry_storm Elevated gateway retry rate on the targeted service node.
cpu_hotspot Hot frame amplification on the targeted node (profiling).
memory_leak Growing heap on the targeted node (profiling).
lock_contention Mutex/block contention on the targeted node (profiling).
goroutine_leak Goroutine accumulation on the targeted node (profiling).
web_vitals_degraded Browser web-vitals degrade on the targeted frontend node — LCP/INP/TTFB/FCP/CLS spike.

cloud axis

Mode Effect
bedrock_throttle Bedrock invocation throttling climbs.
agentcore_throttle AgentCore request throttles + system_errors spike (region-scoped capacity constraint).
portkey_scrape_degraded Portkey analytics scrape degrades — API error_rate and 4xx/5xx share climb, latency rises, poller falls behind.
eval_quality_degraded LangSmith eval quality regresses — faithfulness/completeness/relevance and retrieval scores drop while retry/fallback/HITL rates and error/pending run-outcomes climb.

network axis

Mode Effect
nettopo_devices_unreachable SNMP polling fails for a fraction of devices (walk errors spike, device discovery drops).
nettopo_discovery_slow Discovery cycle duration inflates (cycle_duration_seconds and module walk times rise).
nettopo_walker_degraded Walker outcome errors climb; edge count under-reports (partial topology visibility).
nettopo_auth_failures SNMP credential trials fail (credential_trials_total error rate rises).
nettopo_spoke_down A federation spoke goes offline (network_topology_federation_spoke_up drops to 0, hub/spoke session metrics degrade).

Definition vs activation

The scenarios: and incidents: blocks are the definition layer: they describe what modes exist in this blueprint and when they are scheduled to fire. The runner validates these at load time against the actual construct vocabulary and target inventory.

Activation is separate:

  • The scheduled incidents: entries fire automatically according to their at: or every: schedule while the process runs.
  • Live activation via the control plane is additive: it unions on top of any currently scheduled windows. A scenario activated live runs until explicitly deactivated, regardless of the incidents: schedule.

For live activation, see Control Plane. The control plane also exposes GET /control/schema, which returns the complete derived vocabulary — modes, addressable targets with current scaling state, and all named scenarios — for the loaded blueprints.

Complete example

name: mine
scenarios:
  - name: db_storm
    title: Database connection storm
    effects:
      - { mode: connection_saturation, target: mine-app-db, intensity: 0.8 }
      - { mode: slow_query_storm,      target: mine-app-db, intensity: 0.6 }
      - { mode: latency_spike,         target: mine-api,    intensity: 0.4 }

  - name: ai_brownout
    title: AI gateway brownout
    effects:
      - { mode: agentcore_throttle,                    intensity: 0.7 }
      - { mode: retry_storm, target: mine-api-backend, intensity: 0.5 }

incidents:
  # One-shot scheduled event
  - scenario: db_storm
    at: "2026-07-15T14:00"
    for: 25m

  # Ambient non-prod churn
  - kind: oom_kill
    target: mine-dev-eks
    every: 20m
    for: 5m
    intensity: 0.2

  # Single-mode with explicit target
  - kind: pod_crashloop
    target: mine-dev-eks
    at: "2026-07-15T10:05"
    for: 12m
    intensity: 0.5