Get the FREE Ultimate OpenClaw Setup Guide →

ai-sre-incident-response

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/ai-sre-incident-response --openclaw
Files (1)
SKILL.md
2.1 KB

AI SRE Incident Response

Apply SRE rigor to AI systems where incidents include quality regressions, unsafe outputs, and budget explosions.

AI Incident Classes

  • Availability incident: model/provider unavailable, timeout storm.
  • Quality incident: answer accuracy or tool success drops below SLO.
  • Safety incident: harmful or policy-violating outputs increase.
  • Cost incident: unexpected token or provider spend spike.

Severity Framework (Example)

  • SEV1: user-facing outage, critical compliance risk, or active data leak.
  • SEV2: major degradation affecting key flows.
  • SEV3: limited impact or internal-only issue.

Golden Signals for AI Services

  • Request success rate
  • Latency (queue + generation + tool execution)
  • Hallucination/groundedness proxy metrics
  • Cost per minute and per tenant
  • Guardrail violation rate

Response Playbooks

Model Outage

  1. Freeze deployments.
  2. Shift traffic to fallback model/provider.
  3. Enforce stricter rate limits.
  4. Communicate ETA and mitigation.

Quality Regression

  1. Roll back prompt/model version.
  2. Disable risky optimization flags.
  3. Increase sampling for trace review.
  4. Re-run latest eval baseline.

Cost Spike

  1. Identify top tenants/routes/models.
  2. Enable cache + cheaper fallback path.
  3. Apply temporary token caps.
  4. Open postmortem with prevention actions.

Postmortem Requirements

  • Timeline with detector and responder timestamps
  • Blast radius by tenant and feature
  • Missed signals and alert tuning actions
  • Concrete hardening tasks with owners and due dates

Related Skills

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/ai-sre-incident-response/SKILL.mdView on GitHub

Overview

Applies SRE rigor to AI systems to detect and respond to outages, quality regressions, unsafe outputs, and budget spikes. It defines AI incident classes (availability, quality, safety, cost), a severity framework, and golden signals to guide rapid containment and postmortems.

How This Skill Works

Uses a structured taxonomy and metrics to triage AI incidents: classify events as availability, quality, safety, or cost, and apply SEV levels SEV1–SEV3. It relies on golden signals like request success rate, latency, hallucination proxy, cost per minute, and guardrail violation rate, with predefined playbooks for containment, diagnosis, and recovery plus postmortem templates.

When to Use It

  • LLM outages or provider unavailability causing user-visible failures
  • Quality regression where answer accuracy or tool success falls below SLO
  • Safety incidents with harmful or policy-violating outputs increase
  • Cost spikes or runaway spending due to tokens, calls, or tools
  • Budget explosions requiring quick containment and escalation

Quick Start

  1. Step 1: Map AI incidents to availability, quality, safety, and cost; identify golden signals to monitor.
  2. Step 2: Implement the severity framework SEV1 to SEV3 and predefine response playbooks for model outage, quality regression, and cost spike.
  3. Step 3: Establish a postmortem template and run drills to validate detection, containment, and hardening tasks with owners and due dates.

Best Practices

  • Define a consistent AI incident taxonomy: availability, quality, safety, cost, and assign owners
  • Track golden signals: request success rate, latency, hallucination proxy, cost per minute/tenant, guardrail violation rate
  • Enforce strict deployment controls during incidents: freeze deployments, shift traffic to fallback model/provider, apply stricter rate limits
  • Document postmortems with timeline, blast radius, missed signals, alert tuning actions, and hardening tasks with owners and due dates
  • Regularly exercise playbooks and maintain up to date runbooks and baselines

Example Use Cases

  • Model Outage: provider becomes unavailable; freeze deployments, shift to fallback, communicate ETA
  • Quality Regression: post update leads to lower answer accuracy; rollback prompt/model version, re-run baseline evaluation
  • Safety Spike: guardrails fail and outputs become policy-violating; disable risky optimizations and review prompts
  • Cost Spike: sudden token spend surge; identify top tenants, enable cache and cheaper fallbacks
  • Guardrail Violation Spike: spike in unsafe outputs; rapid triage and containment plus remediation

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers ↗