agent-evals
npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/agent-evals --openclawFiles (1)
SKILL.md
966 B
Agent Evals
Create repeatable checks so agent behavior improves safely over time.
Evaluation Layers
- Unit evals: prompt-level correctness
- Tool evals: API/tool call decision quality
- End-to-end evals: realistic multi-step tasks
- Safety evals: prompt injection and data leak resistance
CI/CD Integration
# Example eval pipeline steps
make evals-smoke
make evals-regression
make evals-safety
Best Practices
- Version datasets with expected outputs.
- Track pass rates and score drift over time.
- Block deploys on critical safety regressions.
Related Skills
- github-actions - Eval automation in CI
- ai-agent-security - Security-focused eval cases
Source
git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/agent-evals/SKILL.mdView on GitHub Overview
Agent Evals creates repeatable checks to improve agent behavior safely over time using golden datasets, rubrics, and regression gates. It defines evaluation layers (unit, tool, end-to-end, safety) and supports CI/CD integration to halt deployments when regressions are found.
How This Skill Works
You define evaluation suites anchored by golden datasets and rubrics for each layer (unit, tool, end-to-end, safety). The pipeline runs these evaluations, tracks pass rates and drift, and surfaces failures. Integrate the checks into CI/CD (e.g., via make evals-smoke, make evals-regression, make evals-safety) so critical regressions trigger deployment gates.
When to Use It
- Before releasing a new AI agent version to verify prompt-level correctness (unit evals).
- When validating API/tool call decision quality against defined rubrics (tool evals).
- To test realistic multi-step tasks that end-to-end evals simulate.
- To assess resilience to prompt injection and data leaks (safety evals).
- As part of CI/CD to gate deployments based on evaluation results.
Quick Start
- Step 1: Create golden datasets and rubrics for each evaluation layer (unit, tool, end-to-end, safety).
- Step 2: Wire the evaluation pipeline into CI/CD using commands like make evals-smoke, make evals-regression, make evals-safety.
- Step 3: Monitor results, enforce regression gates, and re-baseline as data and models evolve.
Best Practices
- Version datasets with expected outputs to keep baselines stable.
- Track pass rates and score drift over time to detect quality changes.
- Block deploys on critical safety regressions to protect users.
- Maintain clear rubrics and keep evaluation scripts under version control.
- Separate evaluation data from production data and regularly refresh datasets.
Example Use Cases
- Cement a golden dataset for a QA agent and run unit evals to verify prompt accuracy.
- Define rubrics for tool calls and automate tool evals to ensure consistent API decision quality.
- Run end-to-end evals that simulate user workflows to catch regressions in multi-step tasks.
- Execute safety evals to detect prompt injection or data leakage vulnerabilities.
- Integrate eval pipelines with GitHub Actions to gate deployments on evaluation outcomes.
Frequently Asked Questions
Add this skill to your agents