phoenix-observability
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/phoenix --openclawPhoenix - AI Observability Platform
Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.
When to use Phoenix
Use Phoenix when:
- Debugging LLM application issues with detailed traces
- Running systematic evaluations on datasets
- Monitoring production LLM systems in real-time
- Building experiment pipelines for prompt/model comparison
- Self-hosted observability without vendor lock-in
Key features:
- Tracing: OpenTelemetry-based trace collection for any LLM framework
- Evaluation: LLM-as-judge evaluators for quality assessment
- Datasets: Versioned test sets for regression testing
- Experiments: Compare prompts, models, and configurations
- Playground: Interactive prompt testing with multiple models
- Open-source: Self-hosted with PostgreSQL or SQLite
Use alternatives instead:
- LangSmith: Managed platform with LangChain-first integration
- Weights & Biases: Deep learning experiment tracking focus
- Arize Cloud: Managed Phoenix with enterprise features
- MLflow: General ML lifecycle, model registry focus
Quick start
Installation
pip install arize-phoenix
# With specific backends
pip install arize-phoenix[embeddings] # Embedding analysis
pip install arize-phoenix-otel # OpenTelemetry config
pip install arize-phoenix-evals # Evaluation framework
pip install arize-phoenix-client # Lightweight REST client
Launch Phoenix server
import phoenix as px
# Launch in notebook (ThreadServer mode)
session = px.launch_app()
# View UI
session.view() # Embedded iframe
print(session.url) # http://localhost:6006
Command-line server (production)
# Start Phoenix server
phoenix serve
# With PostgreSQL
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db"
phoenix serve --port 6006
Basic tracing
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
# Configure OpenTelemetry with Phoenix
tracer_provider = register(
project_name="my-llm-app",
endpoint="http://localhost:6006/v1/traces"
)
# Instrument OpenAI SDK
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
# All OpenAI calls are now traced
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)
Core concepts
Traces and spans
A trace represents a complete execution flow, while spans are individual operations within that trace.
from phoenix.otel import register
from opentelemetry import trace
# Setup tracing
tracer_provider = register(project_name="my-app")
tracer = trace.get_tracer(__name__)
# Create custom spans
with tracer.start_as_current_span("process_query") as span:
span.set_attribute("input.value", query)
# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
context = retriever.search(query)
with tracer.start_as_current_span("generate_response"):
response = llm.generate(query, context)
span.set_attribute("output.value", response)
Projects
Projects organize related traces:
import os
os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"
# Or per-trace
from phoenix.otel import register
tracer_provider = register(project_name="experiment-v2")
Framework instrumentation
OpenAI
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
tracer_provider = register()
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
LangChain
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentor
tracer_provider = register()
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
# All LangChain operations traced
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
response = llm.invoke("Hello!")
LlamaIndex
from phoenix.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor
tracer_provider = register()
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)
Anthropic
from phoenix.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
tracer_provider = register()
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
Evaluation framework
Built-in evaluators
from phoenix.evals import (
OpenAIModel,
HallucinationEvaluator,
RelevanceEvaluator,
ToxicityEvaluator,
llm_classify
)
# Setup model for evaluation
eval_model = OpenAIModel(model="gpt-4o")
# Evaluate hallucination
hallucination_eval = HallucinationEvaluator(eval_model)
results = hallucination_eval.evaluate(
input="What is the capital of France?",
output="The capital of France is Paris.",
reference="Paris is the capital of France."
)
Custom evaluators
from phoenix.evals import llm_classify
# Define custom evaluation
def evaluate_helpfulness(input_text, output_text):
template = """
Evaluate if the response is helpful for the given question.
Question: {input}
Response: {output}
Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""
result = llm_classify(
model=eval_model,
template=template,
input=input_text,
output=output_text,
rails=["helpful", "not_helpful"]
)
return result
Run evaluations on dataset
from phoenix import Client
from phoenix.evals import run_evals
client = Client()
# Get spans to evaluate
spans_df = client.get_spans_dataframe(
project_name="my-app",
filter_condition="span_kind == 'LLM'"
)
# Run evaluations
eval_results = run_evals(
dataframe=spans_df,
evaluators=[
HallucinationEvaluator(eval_model),
RelevanceEvaluator(eval_model)
],
provide_explanation=True
)
# Log results back to Phoenix
client.log_evaluations(eval_results)
Datasets and experiments
Create dataset
from phoenix import Client
client = Client()
# Create dataset
dataset = client.create_dataset(
name="qa-test-set",
description="QA evaluation dataset"
)
# Add examples
client.add_examples_to_dataset(
dataset_name="qa-test-set",
examples=[
{
"input": {"question": "What is Python?"},
"output": {"answer": "A programming language"}
},
{
"input": {"question": "What is ML?"},
"output": {"answer": "Machine learning"}
}
]
)
Run experiment
from phoenix import Client
from phoenix.experiments import run_experiment
client = Client()
def my_model(input_data):
"""Your model function."""
question = input_data["question"]
return {"answer": generate_answer(question)}
def accuracy_evaluator(input_data, output, expected):
"""Custom evaluator."""
return {
"score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0,
"label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect"
}
# Run experiment
results = run_experiment(
dataset_name="qa-test-set",
task=my_model,
evaluators=[accuracy_evaluator],
experiment_name="baseline-v1"
)
print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")
Client API
Query traces and spans
from phoenix import Client
client = Client(endpoint="http://localhost:6006")
# Get spans as DataFrame
spans_df = client.get_spans_dataframe(
project_name="my-app",
filter_condition="span_kind == 'LLM'",
limit=1000
)
# Get specific span
span = client.get_span(span_id="abc123")
# Get trace
trace = client.get_trace(trace_id="xyz789")
Log feedback
from phoenix import Client
client = Client()
# Log user feedback
client.log_annotation(
span_id="abc123",
name="user_rating",
annotator_kind="HUMAN",
score=0.8,
label="helpful",
metadata={"comment": "Good response"}
)
Export data
# Export to pandas
df = client.get_spans_dataframe(project_name="my-app")
# Export traces
traces = client.list_traces(project_name="my-app")
Production deployment
Docker
docker run -p 6006:6006 arizephoenix/phoenix:latest
With PostgreSQL
# Set database URL
export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"
# Start server
phoenix serve --host 0.0.0.0 --port 6006
Environment variables
| Variable | Description | Default |
|---|---|---|
PHOENIX_PORT | HTTP server port | 6006 |
PHOENIX_HOST | Server bind address | 127.0.0.1 |
PHOENIX_GRPC_PORT | gRPC/OTLP port | 4317 |
PHOENIX_SQL_DATABASE_URL | Database connection | SQLite temp |
PHOENIX_WORKING_DIR | Data storage directory | OS temp |
PHOENIX_ENABLE_AUTH | Enable authentication | false |
PHOENIX_SECRET | JWT signing secret | Required if auth enabled |
With authentication
export PHOENIX_ENABLE_AUTH=true
export PHOENIX_SECRET="your-secret-key-min-32-chars"
export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"
phoenix serve
Best practices
- Use projects: Separate traces by environment (dev/staging/prod)
- Add metadata: Include user IDs, session IDs for debugging
- Evaluate regularly: Run automated evaluations in CI/CD
- Version datasets: Track test set changes over time
- Monitor costs: Track token usage via Phoenix dashboards
- Self-host: Use PostgreSQL for production deployments
Common issues
Traces not appearing:
from phoenix.otel import register
# Verify endpoint
tracer_provider = register(
project_name="my-app",
endpoint="http://localhost:6006/v1/traces" # Correct endpoint
)
# Force flush
from opentelemetry import trace
trace.get_tracer_provider().force_flush()
High memory in notebook:
# Close session when done
session = px.launch_app()
# ... do work ...
session.close()
px.close_app()
Database connection issues:
# Verify PostgreSQL connection
psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"
# Check Phoenix logs
phoenix serve --log-level debug
References
- Advanced Usage - Custom evaluators, experiments, production setup
- Troubleshooting - Common issues, debugging, performance
Resources
- Documentation: https://docs.arize.com/phoenix
- Repository: https://github.com/Arize-ai/phoenix
- Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
- Version: 12.0.0+
- License: Apache 2.0
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/17-observability/phoenix/SKILL.mdView on GitHub Overview
Phoenix is an open-source AI observability and evaluation platform for LLM applications. It provides tracing, evaluation, datasets, experiments, and real-time monitoring in a self-hosted environment to debug, test, and monitor AI systems without vendor lock-in.
How This Skill Works
Phoenix collects traces via OpenTelemetry for any LLM framework, supports LLM-as-judge evaluators, and stores data in PostgreSQL or SQLite. It enables versioned datasets, experiment pipelines, and an interactive Playground for comparing prompts, models, and configurations.
When to Use It
- Debug LLM application issues with detailed traces
- Run systematic evaluations on datasets using LLM-as-judge evaluators
- Monitor production LLM systems in real-time with dashboards
- Build experiment pipelines to compare prompts, models, and configurations
- Self-host observability to avoid vendor lock-in
Quick Start
- Step 1: Install Phoenix and optional backends with: pip install arize-phoenix, pip install arize-phoenix[embeddings], pip install arize-phoenix-otel, pip install arize-phoenix-evals, pip install arize-phoenix-client
- Step 2: Launch the Phoenix server (e.g., phoenix serve) and access the UI (http://localhost:6006 by default)
- Step 3: Enable tracing by registering a tracer provider and instrumenting your OpenAI SDK (e.g., phoenix.otel.register and OpenAIInstrumentor), then start generating LLM calls that Phoenix traces
Best Practices
- Enable OpenTelemetry tracing across your LLM calls to capture end-to-end flows
- Use versioned datasets for regression testing and reproducible evaluations
- Leverage Phoenix experiments to compare prompts, models, and configurations
- Run the self-hosted setup with PostgreSQL or SQLite to avoid vendor lock-in
- Use the Playground for interactive prompt testing across multiple models
Example Use Cases
- Debug a complex LLM integration by inspecting traces across prompt generation, retrieval, and response
- Run dataset-based evaluations to quantify quality with LLM-as-judge evaluators
- Monitor live production LLM services in real-time with traces and metrics
- Set up experiments to compare different prompts and model configurations
- Self-host a complete observability stack for privacy-sensitive AI workloads
Frequently Asked Questions
Related Skills
evaluating-code-models
Orchestra-Research/AI-Research-SKILLs
Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
langsmith-observability
Orchestra-Research/AI-Research-SKILLs
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
nemo-evaluator-sdk
Orchestra-Research/AI-Research-SKILLs
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
evaluating-llms-harness
Orchestra-Research/AI-Research-SKILLs
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.