What is phoenix-arize-setup?

It's the setup process for Arize Phoenix observability to enable LLM debugging and evaluation.

What instrumentation is supported?

OpenAI auto-instrumentation, LangChain, LlamaIndex, and custom span creation.

What core features are included?

Tracing with OpenTelemetry, LLM-as-judge evaluations, embeddings visualization and drift, retrieval analysis, and datasets for experiments.

phoenix-arize-setup

npx machina-cli add skill a5c-ai/babysitter/phoenix-arize-setup --openclaw

Files (1)

SKILL.md

1.2 KB

Phoenix Arize Setup Skill

Capabilities

Set up Phoenix local server
Configure tracing instrumentation
Design evaluation experiments
Implement embedding visualizations
Set up retrieval analysis
Create custom evaluations with LLM-as-judge

Target Processes

llm-observability-monitoring
agent-evaluation-framework

Implementation Details

Core Features

Tracing: OpenTelemetry-based LLM traces
Evals: LLM-as-judge evaluations
Embeddings: Visualization and drift detection
Retrieval: RAG quality analysis
Datasets: Experiment management

Instrumentation

OpenAI auto-instrumentation
LangChain instrumentation
LlamaIndex instrumentation
Custom span creation

Configuration Options

Phoenix server setup
Trace sampling
Evaluation metrics
Embedding models
Export settings

Best Practices

Comprehensive instrumentation
Regular evaluation runs
Monitor embedding drift
Analyze retrieval quality

Dependencies

arize-phoenix
openinference-instrumentation-openai

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/phoenix-arize-setup/SKILL.md

View on GitHub

Overview

Configures a local Phoenix server with OpenTelemetry tracing, embedding visualizations, and evaluation workflows for LLM debugging and evaluation. It enables end-to-end observability from traces to embeddings and retrieval analysis, including LLM-as-judge evaluations.

How This Skill Works

Install and start a Phoenix server, then enable OpenTelemetry instrumentation across OpenAI, LangChain, LlamaIndex, and custom spans. Run experiments that collect traces, embeddings, and retrieval data, and use Arize dashboards to analyze performance, drift, and RAG quality.

When to Use It

Debug LLM runs with a Phoenix-based observability stack
Design and run LLM-as-judge evaluation experiments
Analyze embedding drift and visualize embeddings
Perform retrieval quality and RAG analysis
Manage experiments with datasets and export settings

Quick Start

Step 1: Install and start the Phoenix server with arize-phoenix
Step 2: Enable OpenTelemetry instrumentation for OpenAI, LangChain, and LlamaIndex
Step 3: Configure evaluation metrics, embedding models, and export settings; run an experiment

Best Practices

Comprehensive instrumentation across tracing, embeddings, and retrieval
Regular evaluation runs to benchmark models
Monitor embedding drift to detect semantic changes
Analyze retrieval quality for RAG analysis
Tightly configure export and evaluation metrics for reproducibility

Example Use Cases

Set up a local Phoenix server with OpenTelemetry to trace an LLM call
Run LLM-as-judge evaluations to compare model outputs
Visualize embeddings and detect drift across model versions
Perform retrieval quality analysis to assess RAG performance
Manage experiments with datasets and export results for reporting

Frequently Asked Questions

Add this skill to your agents