Get the FREE Ultimate OpenClaw Setup Guide →

phoenix-arize-setup

npx machina-cli add skill a5c-ai/babysitter/phoenix-arize-setup --openclaw
Files (1)
SKILL.md
1.2 KB

Phoenix Arize Setup Skill

Capabilities

  • Set up Phoenix local server
  • Configure tracing instrumentation
  • Design evaluation experiments
  • Implement embedding visualizations
  • Set up retrieval analysis
  • Create custom evaluations with LLM-as-judge

Target Processes

  • llm-observability-monitoring
  • agent-evaluation-framework

Implementation Details

Core Features

  1. Tracing: OpenTelemetry-based LLM traces
  2. Evals: LLM-as-judge evaluations
  3. Embeddings: Visualization and drift detection
  4. Retrieval: RAG quality analysis
  5. Datasets: Experiment management

Instrumentation

  • OpenAI auto-instrumentation
  • LangChain instrumentation
  • LlamaIndex instrumentation
  • Custom span creation

Configuration Options

  • Phoenix server setup
  • Trace sampling
  • Evaluation metrics
  • Embedding models
  • Export settings

Best Practices

  • Comprehensive instrumentation
  • Regular evaluation runs
  • Monitor embedding drift
  • Analyze retrieval quality

Dependencies

  • arize-phoenix
  • openinference-instrumentation-openai

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/phoenix-arize-setup/SKILL.mdView on GitHub

Overview

Configures a local Phoenix server with OpenTelemetry tracing, embedding visualizations, and evaluation workflows for LLM debugging and evaluation. It enables end-to-end observability from traces to embeddings and retrieval analysis, including LLM-as-judge evaluations.

How This Skill Works

Install and start a Phoenix server, then enable OpenTelemetry instrumentation across OpenAI, LangChain, LlamaIndex, and custom spans. Run experiments that collect traces, embeddings, and retrieval data, and use Arize dashboards to analyze performance, drift, and RAG quality.

When to Use It

  • Debug LLM runs with a Phoenix-based observability stack
  • Design and run LLM-as-judge evaluation experiments
  • Analyze embedding drift and visualize embeddings
  • Perform retrieval quality and RAG analysis
  • Manage experiments with datasets and export settings

Quick Start

  1. Step 1: Install and start the Phoenix server with arize-phoenix
  2. Step 2: Enable OpenTelemetry instrumentation for OpenAI, LangChain, and LlamaIndex
  3. Step 3: Configure evaluation metrics, embedding models, and export settings; run an experiment

Best Practices

  • Comprehensive instrumentation across tracing, embeddings, and retrieval
  • Regular evaluation runs to benchmark models
  • Monitor embedding drift to detect semantic changes
  • Analyze retrieval quality for RAG analysis
  • Tightly configure export and evaluation metrics for reproducibility

Example Use Cases

  • Set up a local Phoenix server with OpenTelemetry to trace an LLM call
  • Run LLM-as-judge evaluations to compare model outputs
  • Visualize embeddings and detect drift across model versions
  • Perform retrieval quality analysis to assess RAG performance
  • Manage experiments with datasets and export results for reporting

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers