Ai Observability
Scannednpx machina-cli add skill omer-metin/skills-for-antigravity/ai-observability --openclawAi Observability
Identity
Principles
- {'name': 'Trace Every LLM Call', 'description': 'Production AI apps without tracing are flying blind. Every LLM call\nshould be traced with inputs, outputs, latency, tokens, and cost.\nUse structured spans for multi-step chains and agents.\n'}
- {'name': 'Measure What Matters', 'description': "Track metrics that correlate with user value: faithfulness for RAG,\nanswer relevancy, latency percentiles, cost per successful outcome.\nVanity metrics (total calls) don't improve product quality.\n"}
- {'name': 'Cost Is a First-Class Metric', 'description': 'Token costs can explode overnight with agent loops or context growth.\nTrack cost per user, per feature, per model. Set budgets and alerts.\nPrompt caching can cut costs by 50-90%.\n'}
- {'name': 'Evaluate Continuously', 'description': 'Run automated evals on production samples. RAGAS metrics (faithfulness,\nrelevancy, context precision) catch quality degradation before users\ncomplain. Score > 0.8 is generally good.\n'}
Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
- For Creation: Always consult
references/patterns.md. This file dictates how things should be built. Ignore generic approaches if a specific pattern exists here. - For Diagnosis: Always consult
references/sharp_edges.md. This file lists the critical failures and "why" they happen. Use it to explain risks to the user. - For Review: Always consult
references/validations.md. This contains the strict rules and constraints. Use it to validate user inputs objectively.
Note: If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
Source
git clone https://github.com/omer-metin/skills-for-antigravity/blob/main/skills/ai-observability/SKILL.mdView on GitHub Overview
Ai Observability provides end-to-end visibility for LLM applications, tying together tracing (Langfuse/Helicone), cost tracking, token optimization, RAG evaluation metrics (RAGAS), hallucination detection, and production monitoring. Itβs essential for debugging, optimizing costs, and safeguarding output quality in live AI systems.
How This Skill Works
Instrument every LLM call with structured spans to capture inputs, outputs, latency, tokens, and cost. Aggregate metrics by user, feature, and model; compute RAGAS scores for faithfulness and relevancy; run hallucination checks; and surface alerts through your production monitoring stack for rapid action.
When to Use It
- You need end-to-end tracing for multi-step LLM pipelines
- You want to monitor and control token usage and costs
- You rely on RAG for data retrieval and need RAGAS evaluation
- You must detect hallucinations and monitor output quality in production
- You plan to implement prompt caching to reduce context length and costs
Quick Start
- Step 1: Enable tracing for all LLM calls using Langfuse or Helicone and define structured spans
- Step 2: Instrument inputs, outputs, latency, tokens, and cost; set up per-model and per-feature cost tracking
- Step 3: Configure RAGAS metrics, enable hallucination detection, and connect dashboards and prompt caching
Best Practices
- Trace every LLM call with inputs, outputs, latency, tokens, and cost using structured spans
- Track cost per user, per feature, and per model; set budgets and alerts; leverage prompt caching to cut costs
- Measure RAGAS metrics (faithfulness, relevancy, context precision) and monitor trends over time
- Enable hallucination detection and establish risk-based alerting
- Integrate with Langfuse/Helicone for tracing and implement production monitoring dashboards
Example Use Cases
- Tracing a multi-step chat workflow to locate latency and token bloat
- Cost dashboards showing per-feature token usage and budget adherence
- RAGAS-based evaluation on live data to detect faithfulness drift
- Hallucination alerts triggered by risky outputs in customer support bots
- Prompt caching implemented to reduce repeated context and lower costs