What are the evaluation layers and why are they important?

Unit evals verify prompt-level correctness, tool evals check API/tool call quality, end-to-end evals test realistic multi-step tasks, and safety evals assess prompt injection and data leakage resistance to keep agents safe.

How do I integrate agent-evals into CI/CD?

Use the provided commands (e.g., make evals-smoke, make evals-regression, make evals-safety) to run evaluations in your pipeline and gate deployments when regressions are detected.

How should I handle score drift and baselines?

Version datasets and expected outputs, monitor drift over time, and re-baseline when the data or task changes to keep evaluations meaningful and fair.

agent-evals

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/agent-evals --openclaw

Files (1)

SKILL.md

966 B

Agent Evals

Create repeatable checks so agent behavior improves safely over time.

Evaluation Layers

Unit evals: prompt-level correctness
Tool evals: API/tool call decision quality
End-to-end evals: realistic multi-step tasks
Safety evals: prompt injection and data leak resistance

CI/CD Integration

# Example eval pipeline steps
make evals-smoke
make evals-regression
make evals-safety

Best Practices

Version datasets with expected outputs.
Track pass rates and score drift over time.
Block deploys on critical safety regressions.

Related Skills

github-actions - Eval automation in CI
ai-agent-security - Security-focused eval cases

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/agent-evals/SKILL.mdView on GitHub

Overview

Agent Evals creates repeatable checks to improve agent behavior safely over time using golden datasets, rubrics, and regression gates. It defines evaluation layers (unit, tool, end-to-end, safety) and supports CI/CD integration to halt deployments when regressions are found.

How This Skill Works

You define evaluation suites anchored by golden datasets and rubrics for each layer (unit, tool, end-to-end, safety). The pipeline runs these evaluations, tracks pass rates and drift, and surfaces failures. Integrate the checks into CI/CD (e.g., via make evals-smoke, make evals-regression, make evals-safety) so critical regressions trigger deployment gates.

When to Use It

Before releasing a new AI agent version to verify prompt-level correctness (unit evals).
When validating API/tool call decision quality against defined rubrics (tool evals).
To test realistic multi-step tasks that end-to-end evals simulate.
To assess resilience to prompt injection and data leaks (safety evals).
As part of CI/CD to gate deployments based on evaluation results.

Quick Start

Step 1: Create golden datasets and rubrics for each evaluation layer (unit, tool, end-to-end, safety).
Step 2: Wire the evaluation pipeline into CI/CD using commands like make evals-smoke, make evals-regression, make evals-safety.
Step 3: Monitor results, enforce regression gates, and re-baseline as data and models evolve.

Best Practices

Version datasets with expected outputs to keep baselines stable.
Track pass rates and score drift over time to detect quality changes.
Block deploys on critical safety regressions to protect users.
Maintain clear rubrics and keep evaluation scripts under version control.
Separate evaluation data from production data and regularly refresh datasets.

Example Use Cases

Cement a golden dataset for a QA agent and run unit evals to verify prompt accuracy.
Define rubrics for tool calls and automate tool evals to ensure consistent API decision quality.
Run end-to-end evals that simulate user workflows to catch regressions in multi-step tasks.
Execute safety evals to detect prompt injection or data leakage vulnerabilities.
Integrate eval pipelines with GitHub Actions to gate deployments on evaluation outcomes.

Frequently Asked Questions

Add this skill to your agents