What is Ai Observability?

A framework to observe LLM apps across tracing, cost tracking, token optimization, RAG evaluation (RAGAS), hallucination detection, and production monitoring.

What are RAGAS metrics?

RAGAS stands for faithfulness, relevancy, and context precision; used to evaluate RAG quality and detect degradation in production.

How does it reduce costs?

By tracking token usage per user and feature, setting budgets, and enabling prompt caching to minimize repeated context and expensive calls.

Ai Observability

Scanned

npx machina-cli add skill omer-metin/skills-for-antigravity/ai-observability --openclaw

Files (1)

SKILL.md

2.3 KB

Ai Observability

Identity

Principles

{'name': 'Trace Every LLM Call', 'description': 'Production AI apps without tracing are flying blind. Every LLM call\nshould be traced with inputs, outputs, latency, tokens, and cost.\nUse structured spans for multi-step chains and agents.\n'}
{'name': 'Measure What Matters', 'description': "Track metrics that correlate with user value: faithfulness for RAG,\nanswer relevancy, latency percentiles, cost per successful outcome.\nVanity metrics (total calls) don't improve product quality.\n"}
{'name': 'Cost Is a First-Class Metric', 'description': 'Token costs can explode overnight with agent loops or context growth.\nTrack cost per user, per feature, per model. Set budgets and alerts.\nPrompt caching can cut costs by 50-90%.\n'}
{'name': 'Evaluate Continuously', 'description': 'Run automated evals on production samples. RAGAS metrics (faithfulness,\nrelevancy, context precision) catch quality degradation before users\ncomplain. Score > 0.8 is generally good.\n'}

Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

For Creation: Always consult references/patterns.md. This file dictates how things should be built. Ignore generic approaches if a specific pattern exists here.
For Diagnosis: Always consult references/sharp_edges.md. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
For Review: Always consult references/validations.md. This contains the strict rules and constraints. Use it to validate user inputs objectively.

Note: If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Source

git clone https://github.com/omer-metin/skills-for-antigravity/blob/main/skills/ai-observability/SKILL.mdView on GitHub

Overview

Ai Observability provides end-to-end visibility for LLM applications, tying together tracing (Langfuse/Helicone), cost tracking, token optimization, RAG evaluation metrics (RAGAS), hallucination detection, and production monitoring. It’s essential for debugging, optimizing costs, and safeguarding output quality in live AI systems.

How This Skill Works

Instrument every LLM call with structured spans to capture inputs, outputs, latency, tokens, and cost. Aggregate metrics by user, feature, and model; compute RAGAS scores for faithfulness and relevancy; run hallucination checks; and surface alerts through your production monitoring stack for rapid action.

When to Use It

You need end-to-end tracing for multi-step LLM pipelines
You want to monitor and control token usage and costs
You rely on RAG for data retrieval and need RAGAS evaluation
You must detect hallucinations and monitor output quality in production
You plan to implement prompt caching to reduce context length and costs

Quick Start

Step 1: Enable tracing for all LLM calls using Langfuse or Helicone and define structured spans
Step 2: Instrument inputs, outputs, latency, tokens, and cost; set up per-model and per-feature cost tracking
Step 3: Configure RAGAS metrics, enable hallucination detection, and connect dashboards and prompt caching

Best Practices

Trace every LLM call with inputs, outputs, latency, tokens, and cost using structured spans
Track cost per user, per feature, and per model; set budgets and alerts; leverage prompt caching to cut costs
Measure RAGAS metrics (faithfulness, relevancy, context precision) and monitor trends over time
Enable hallucination detection and establish risk-based alerting
Integrate with Langfuse/Helicone for tracing and implement production monitoring dashboards

Example Use Cases

Tracing a multi-step chat workflow to locate latency and token bloat
Cost dashboards showing per-feature token usage and budget adherence
RAGAS-based evaluation on live data to detect faithfulness drift
Hallucination alerts triggered by risky outputs in customer support bots
Prompt caching implemented to reduce repeated context and lower costs

Frequently Asked Questions

Add this skill to your agents