LangSmith is an LLM observability platform for tracing, evaluation, monitoring, and testing of language-model applications.

Inputs, outputs, latency, and related execution data for each LLM call, with hierarchical runs and traces.

How do I get started?

Install langsmith, set your API key, and instrument your code with traceable or wrappers to begin collecting runs.

langsmith-observability

Scanned

Observability LangSmith Tracing Evaluation Monitoring Debugging Testing LLM Ops Production

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/langsmith --openclaw

Files (1)

SKILL.md

9.5 KB

LangSmith - LLM Observability Platform

Development platform for debugging, evaluating, and monitoring language models and AI applications.

When to use LangSmith

Use LangSmith when:

Debugging LLM application issues (prompts, chains, agents)
Evaluating model outputs systematically against datasets
Monitoring production LLM systems
Building regression testing for AI features
Analyzing latency, token usage, and costs
Collaborating on prompt engineering

Key features:

Tracing: Capture inputs, outputs, latency for all LLM calls
Evaluation: Systematic testing with built-in and custom evaluators
Datasets: Create test sets from production traces or manually
Monitoring: Track metrics, errors, and costs in production
Integrations: Works with OpenAI, Anthropic, LangChain, LlamaIndex

Use alternatives instead:

Weights & Biases: Deep learning experiment tracking, model training
MLflow: General ML lifecycle, model registry focus
Arize/WhyLabs: ML monitoring, data drift detection

Quick start

Installation

pip install langsmith

# Set environment variables
export LANGSMITH_API_KEY="your-api-key"
export LANGSMITH_TRACING=true

Basic tracing with @traceable

from langsmith import traceable
from openai import OpenAI

client = OpenAI()

@traceable
def generate_response(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

# Automatically traced to LangSmith
result = generate_response("What is machine learning?")

OpenAI wrapper (automatic tracing)

from langsmith.wrappers import wrap_openai
from openai import OpenAI

# Wrap client for automatic tracing
client = wrap_openai(OpenAI())

# All calls automatically traced
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Core concepts

Runs and traces

A run is a single execution unit (LLM call, chain, tool). Runs form hierarchical traces showing the full execution flow.

from langsmith import traceable

@traceable(run_type="chain")
def process_query(query: str) -> str:
    # Parent run
    context = retrieve_context(query)  # Child run
    response = generate_answer(query, context)  # Child run
    return response

@traceable(run_type="retriever")
def retrieve_context(query: str) -> list:
    return vector_store.search(query)

@traceable(run_type="llm")
def generate_answer(query: str, context: list) -> str:
    return llm.invoke(f"Context: {context}\n\nQuestion: {query}")

Projects

Projects organize related runs. Set via environment or code:

import os
os.environ["LANGSMITH_PROJECT"] = "my-project"

# Or per-function
@traceable(project_name="my-project")
def my_function():
    pass

Client API

from langsmith import Client

client = Client()

# List runs
runs = list(client.list_runs(
    project_name="my-project",
    filter='eq(status, "success")',
    limit=100
))

# Get run details
run = client.read_run(run_id="...")

# Create feedback
client.create_feedback(
    run_id="...",
    key="correctness",
    score=0.9,
    comment="Good answer"
)

Datasets and evaluation

Create dataset

from langsmith import Client

client = Client()

# Create dataset
dataset = client.create_dataset("qa-test-set", description="QA evaluation")

# Add examples
client.create_examples(
    inputs=[
        {"question": "What is Python?"},
        {"question": "What is ML?"}
    ],
    outputs=[
        {"answer": "A programming language"},
        {"answer": "Machine learning"}
    ],
    dataset_id=dataset.id
)

Run evaluation

from langsmith import evaluate

def my_model(inputs: dict) -> dict:
    # Your model logic
    return {"answer": generate_answer(inputs["question"])}

def correctness_evaluator(run, example):
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[correctness_evaluator],
    experiment_prefix="v1"
)

print(f"Average score: {results.aggregate_metrics['correctness']}")

Built-in evaluators

from langsmith.evaluation import LangChainStringEvaluator

# Use LangChain evaluators
results = evaluate(
    my_model,
    data="qa-test-set",
    evaluators=[
        LangChainStringEvaluator("qa"),
        LangChainStringEvaluator("cot_qa")
    ]
)

Advanced tracing

Tracing context

from langsmith import tracing_context

with tracing_context(
    project_name="experiment-1",
    tags=["production", "v2"],
    metadata={"version": "2.0"}
):
    # All traceable calls inherit context
    result = my_function()

Manual runs

from langsmith import trace

with trace(
    name="custom_operation",
    run_type="tool",
    inputs={"query": "test"}
) as run:
    result = do_something()
    run.end(outputs={"result": result})

Process inputs/outputs

def sanitize_inputs(inputs: dict) -> dict:
    if "password" in inputs:
        inputs["password"] = "***"
    return inputs

@traceable(process_inputs=sanitize_inputs)
def login(username: str, password: str):
    return authenticate(username, password)

Sampling

import os
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"  # 10% sampling

LangChain integration

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

# Tracing enabled automatically with LANGSMITH_TRACING=true
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("user", "{input}")
])

chain = prompt | llm

# All chain runs traced automatically
response = chain.invoke({"input": "Hello!"})

Production monitoring

Hub prompts

from langsmith import Client

client = Client()

# Pull prompt from hub
prompt = client.pull_prompt("my-org/qa-prompt")

# Use in application
result = prompt.invoke({"question": "What is AI?"})

Async client

from langsmith import AsyncClient

async def main():
    client = AsyncClient()

    runs = []
    async for run in client.list_runs(project_name="my-project"):
        runs.append(run)

    return runs

Feedback collection

from langsmith import Client

client = Client()

# Collect user feedback
def record_feedback(run_id: str, user_rating: int, comment: str = None):
    client.create_feedback(
        run_id=run_id,
        key="user_rating",
        score=user_rating / 5.0,  # Normalize to 0-1
        comment=comment
    )

# In your application
record_feedback(run_id="...", user_rating=4, comment="Helpful response")

Testing integration

Pytest integration

from langsmith import test

@test
def test_qa_accuracy():
    result = my_qa_function("What is Python?")
    assert "programming" in result.lower()

Evaluation in CI/CD

from langsmith import evaluate

def run_evaluation():
    results = evaluate(
        my_model,
        data="regression-test-set",
        evaluators=[accuracy_evaluator]
    )

    # Fail CI if accuracy drops
    assert results.aggregate_metrics["accuracy"] >= 0.9, \
        f"Accuracy {results.aggregate_metrics['accuracy']} below threshold"

Best practices

Structured naming - Use consistent project/run naming conventions
Add metadata - Include version, environment, user info
Sample in production - Use sampling rate to control volume
Create datasets - Build test sets from interesting production cases
Automate evaluation - Run evaluations in CI/CD pipelines
Monitor costs - Track token usage and latency trends

Common issues

Traces not appearing:

import os
# Ensure tracing is enabled
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "your-key"

# Verify connection
from langsmith import Client
client = Client()
print(client.list_projects())  # Should work

High latency from tracing:

# Enable background batching (default)
from langsmith import Client
client = Client(auto_batch_tracing=True)

# Or use sampling
os.environ["LANGSMITH_TRACING_SAMPLING_RATE"] = "0.1"

Large payloads:

# Hide sensitive/large fields
@traceable(
    process_inputs=lambda x: {k: v for k, v in x.items() if k != "large_field"}
)
def my_function(data):
    pass

References

Advanced Usage - Custom evaluators, distributed tracing, hub prompts
Troubleshooting - Common issues, debugging, performance

Resources

Documentation: https://docs.smith.langchain.com
Python SDK: https://github.com/langchain-ai/langsmith-sdk
Web App: https://smith.langchain.com
Version: 0.2.0+
License: MIT

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/17-observability/langsmith/SKILL.mdView on GitHub

Overview

LangSmith is an LLM observability platform for tracing, evaluation, and monitoring. It helps you debug prompts and chains, evaluate outputs against datasets, monitor production LLM systems, and build regression tests to ensure AI features stay reliable.

How This Skill Works

LangSmith captures inputs, outputs, and latency for every LLM call using tracing and runs. It provides built-in and custom evaluators to run systematic tests, and it lets you build datasets from production traces for evaluation and regression testing. It integrates with OpenAI, Anthropic, LangChain, and LlamaIndex to simplify workflows and monitoring.

When to Use It

Debug LLM application issues (prompts, chains, agents).
Evaluate model outputs against curated datasets.
Monitor production LLM systems for performance and costs.
Build regression tests for AI features.
Analyze latency, token usage, and cost trends.

Quick Start

Step 1: Install langsmith and set your API key (export LANGSMITH_API_KEY).
Step 2: Enable tracing with @traceable or by wrapping the OpenAI client (wrap_openai).
Step 3: Run your app and view runs and traces in the LangSmith dashboard.

Best Practices

Enable tracing on all LLM calls to capture inputs, outputs, and latency.
Create test datasets from production traces for robust evaluation.
Leverage built-in evaluators and add domain-specific evaluators as needed.
Monitor production metrics, errors, and costs to detect regressions.
Integrate LangSmith with prompts, chains, and wrappers (e.g., OpenAI wrappers).

Example Use Cases

Debug a failing chat flow by tracing a prompt through a chain and measuring latency.
Evaluate a QA system against a test set and compare results in a dashboard.
Monitor a production assistant for spikes in errors and costs.
Run regression tests when releasing a new AI feature with datasets and evaluators.
Compare different prompts via traces to identify the best strategy.