What signals are tracked by agent-observability?

Latency (p50/p95/p99), token usage (prompt/completion/cached), tool call outcomes, cost per task/customer, and retry/hallucination frequency.

How should I handle PII in traces?

Redact PII before exporting traces and keep replayable envelopes for incident reviews.

What tooling works with these patterns?

Any observability backend that supports tracing and logs; use trace IDs, spans, and structured logs to build dashboards.

agent-observability

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/agent-observability --openclaw

Files (1)

SKILL.md

1.1 KB

Agent Observability

Monitor AI agent behavior with logs, traces, metrics, and cost telemetry.

Track Core Signals

Request latency (p50/p95/p99)
Token usage (prompt/completion/cached)
Tool call success and failure rates
Cost per task and per customer
Hallucination and retry frequency

Implementation Pattern

Add trace IDs to every user request.
Capture each LLM call and tool call as child spans.
Emit structured logs with model, temperature, and response status.
Create SLOs for success rate and median response time.

Best Practices

Redact PII before exporting traces.
Keep a replayable request envelope for incident review.
Alert on abnormal token spikes and tool error bursts.

Related Skills

alerting-oncall - Alert workflows
agent-evals - Quality verification

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/agent-observability/SKILL.mdView on GitHub

Overview

Agent observability provides end-to-end visibility into AI agent behavior by tracing requests, tracking token usage, latency, and cost telemetry. It enables reliability, faster debugging, and informed incident response.

How This Skill Works

Add trace IDs to every user request and model inputs. Capture each LLM call and tool call as child spans, emitting structured logs with model, temperature, and response status. Define SLOs for success rate and median latency to drive reliability.

When to Use It

Diagnose slow or failing AI agent responses
Understand token usage and cost per task or per customer
Monitor tool call reliability, retries, and failures
Detect hallucinations and abnormal latency spikes
Perform post-incident reviews with replayable request envelopes

Quick Start

Step 1: Add trace IDs to every incoming user request
Step 2: Capture each LLM call and tool interaction as child spans and emit structured logs
Step 3: Create SLOs for median latency and success rate, and build dashboards

Best Practices

Redact PII before exporting traces
Keep a replayable request envelope for incident review
Alert on abnormal token spikes and bursts of tool errors
Instrument LLM and tool calls with structured logs and spans
Define and monitor SLOs for success rate and median response time

Example Use Cases

An AI assistant tracks p95 latency and per-task token costs to optimize pricing and performance
Incident review includes a replayable envelope showing request, model config, and outcomes
Costs are surfaced per customer, helping teams identify expensive workflows
Tool-call success/failure rates are monitored to reduce user-visible failures
Redacted traces are exported to a centralized observability platform for audits

Frequently Asked Questions

Add this skill to your agents