agent-observability
npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/agent-observability --openclawFiles (1)
SKILL.md
1.1 KB
Agent Observability
Monitor AI agent behavior with logs, traces, metrics, and cost telemetry.
Track Core Signals
- Request latency (p50/p95/p99)
- Token usage (prompt/completion/cached)
- Tool call success and failure rates
- Cost per task and per customer
- Hallucination and retry frequency
Implementation Pattern
- Add trace IDs to every user request.
- Capture each LLM call and tool call as child spans.
- Emit structured logs with model, temperature, and response status.
- Create SLOs for success rate and median response time.
Best Practices
- Redact PII before exporting traces.
- Keep a replayable request envelope for incident review.
- Alert on abnormal token spikes and tool error bursts.
Related Skills
- alerting-oncall - Alert workflows
- agent-evals - Quality verification
Source
git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/agent-observability/SKILL.mdView on GitHub Overview
Agent observability provides end-to-end visibility into AI agent behavior by tracing requests, tracking token usage, latency, and cost telemetry. It enables reliability, faster debugging, and informed incident response.
How This Skill Works
Add trace IDs to every user request and model inputs. Capture each LLM call and tool call as child spans, emitting structured logs with model, temperature, and response status. Define SLOs for success rate and median latency to drive reliability.
When to Use It
- Diagnose slow or failing AI agent responses
- Understand token usage and cost per task or per customer
- Monitor tool call reliability, retries, and failures
- Detect hallucinations and abnormal latency spikes
- Perform post-incident reviews with replayable request envelopes
Quick Start
- Step 1: Add trace IDs to every incoming user request
- Step 2: Capture each LLM call and tool interaction as child spans and emit structured logs
- Step 3: Create SLOs for median latency and success rate, and build dashboards
Best Practices
- Redact PII before exporting traces
- Keep a replayable request envelope for incident review
- Alert on abnormal token spikes and bursts of tool errors
- Instrument LLM and tool calls with structured logs and spans
- Define and monitor SLOs for success rate and median response time
Example Use Cases
- An AI assistant tracks p95 latency and per-task token costs to optimize pricing and performance
- Incident review includes a replayable envelope showing request, model config, and outcomes
- Costs are surfaced per customer, helping teams identify expensive workflows
- Tool-call success/failure rates are monitored to reduce user-visible failures
- Redacted traces are exported to a centralized observability platform for audits
Frequently Asked Questions
Add this skill to your agents