What incidents are covered by AI SRE Incident Response?

Availability, quality, safety, and cost incidents in AI systems.

What are the golden signals?

Request success rate, latency, hallucination proxy, cost per minute/tenant, guardrail violation rate.

What should be included in postmortems?

Timeline with detector and responder timestamps, blast radius by tenant and feature, missed signals and alert tuning actions, concrete hardening tasks with owners and due dates.

ai-sre-incident-response

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/ai-sre-incident-response --openclaw

Files (1)

SKILL.md

2.1 KB

AI SRE Incident Response

Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.

AI Incident Classes

Availability incident: model/provider unavailable, timeout storm.
Quality incident: answer accuracy or tool success drops below SLO.
Safety incident: harmful or policy-violating outputs increase.
Cost incident: unexpected token or provider spend spike.

Severity Framework (Example)

SEV1: user-facing outage, critical compliance risk, or active data leak.
SEV2: major degradation affecting key flows.
SEV3: limited impact or internal-only issue.

Golden Signals for AI Services

Request success rate
Latency (queue + generation + tool execution)
Hallucination/groundedness proxy metrics
Cost per minute and per tenant
Guardrail violation rate

Response Playbooks

Model Outage

Freeze deployments.
Shift traffic to fallback model/provider.
Enforce stricter rate limits.
Communicate ETA and mitigation.

Quality Regression

Roll back prompt/model version.
Disable risky optimization flags.
Increase sampling for trace review.
Re-run latest eval baseline.

Cost Spike

Identify top tenants/routes/models.
Enable cache + cheaper fallback path.
Apply temporary token caps.
Open postmortem with prevention actions.

Postmortem Requirements

Timeline with detector and responder timestamps
Blast radius by tenant and feature
Missed signals and alert tuning actions
Concrete hardening tasks with owners and due dates

Related Skills

incident-response - Standard incident process and evidence
alerting-oncall - Paging and escalation policy
llm-cost-optimization - Spend controls and efficiency patterns

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/ai-sre-incident-response/SKILL.md

View on GitHub

Overview

Applies SRE rigor to AI systems to detect and respond to outages, quality regressions, unsafe outputs, and budget spikes. It defines AI incident classes (availability, quality, safety, cost), a severity framework, and golden signals to guide rapid containment and postmortems.

How This Skill Works

Uses a structured taxonomy and metrics to triage AI incidents: classify events as availability, quality, safety, or cost, and apply SEV levels SEV1–SEV3. It relies on golden signals like request success rate, latency, hallucination proxy, cost per minute, and guardrail violation rate, with predefined playbooks for containment, diagnosis, and recovery plus postmortem templates.

When to Use It

LLM outages or provider unavailability causing user-visible failures
Quality regression where answer accuracy or tool success falls below SLO
Safety incidents with harmful or policy-violating outputs increase
Cost spikes or runaway spending due to tokens, calls, or tools
Budget explosions requiring quick containment and escalation

Quick Start

Step 1: Map AI incidents to availability, quality, safety, and cost; identify golden signals to monitor.
Step 2: Implement the severity framework SEV1 to SEV3 and predefine response playbooks for model outage, quality regression, and cost spike.
Step 3: Establish a postmortem template and run drills to validate detection, containment, and hardening tasks with owners and due dates.

Best Practices

Define a consistent AI incident taxonomy: availability, quality, safety, cost, and assign owners
Track golden signals: request success rate, latency, hallucination proxy, cost per minute/tenant, guardrail violation rate
Enforce strict deployment controls during incidents: freeze deployments, shift traffic to fallback model/provider, apply stricter rate limits
Document postmortems with timeline, blast radius, missed signals, alert tuning actions, and hardening tasks with owners and due dates
Regularly exercise playbooks and maintain up to date runbooks and baselines

Example Use Cases

Model Outage: provider becomes unavailable; freeze deployments, shift to fallback, communicate ETA
Quality Regression: post update leads to lower answer accuracy; rollback prompt/model version, re-run baseline evaluation
Safety Spike: guardrails fail and outputs become policy-violating; disable risky optimizations and review prompts
Cost Spike: sudden token spend surge; identify top tenants, enable cache and cheaper fallbacks
Guardrail Violation Spike: spike in unsafe outputs; rapid triage and containment plus remediation

Frequently Asked Questions

Add this skill to your agents