Get the FREE Ultimate OpenClaw Setup Guide →

agent-evals

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/agent-evals --openclaw
Files (1)
SKILL.md
966 B

Agent Evals

Create repeatable checks so agent behavior improves safely over time.

Evaluation Layers

  • Unit evals: prompt-level correctness
  • Tool evals: API/tool call decision quality
  • End-to-end evals: realistic multi-step tasks
  • Safety evals: prompt injection and data leak resistance

CI/CD Integration

# Example eval pipeline steps
make evals-smoke
make evals-regression
make evals-safety

Best Practices

  • Version datasets with expected outputs.
  • Track pass rates and score drift over time.
  • Block deploys on critical safety regressions.

Related Skills

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/devops/ai/agent-evals/SKILL.mdView on GitHub

Overview

Agent Evals creates repeatable checks to improve agent behavior safely over time using golden datasets, rubrics, and regression gates. It defines evaluation layers (unit, tool, end-to-end, safety) and supports CI/CD integration to halt deployments when regressions are found.

How This Skill Works

You define evaluation suites anchored by golden datasets and rubrics for each layer (unit, tool, end-to-end, safety). The pipeline runs these evaluations, tracks pass rates and drift, and surfaces failures. Integrate the checks into CI/CD (e.g., via make evals-smoke, make evals-regression, make evals-safety) so critical regressions trigger deployment gates.

When to Use It

  • Before releasing a new AI agent version to verify prompt-level correctness (unit evals).
  • When validating API/tool call decision quality against defined rubrics (tool evals).
  • To test realistic multi-step tasks that end-to-end evals simulate.
  • To assess resilience to prompt injection and data leaks (safety evals).
  • As part of CI/CD to gate deployments based on evaluation results.

Quick Start

  1. Step 1: Create golden datasets and rubrics for each evaluation layer (unit, tool, end-to-end, safety).
  2. Step 2: Wire the evaluation pipeline into CI/CD using commands like make evals-smoke, make evals-regression, make evals-safety.
  3. Step 3: Monitor results, enforce regression gates, and re-baseline as data and models evolve.

Best Practices

  • Version datasets with expected outputs to keep baselines stable.
  • Track pass rates and score drift over time to detect quality changes.
  • Block deploys on critical safety regressions to protect users.
  • Maintain clear rubrics and keep evaluation scripts under version control.
  • Separate evaluation data from production data and regularly refresh datasets.

Example Use Cases

  • Cement a golden dataset for a QA agent and run unit evals to verify prompt accuracy.
  • Define rubrics for tool calls and automate tool evals to ensure consistent API decision quality.
  • Run end-to-end evals that simulate user workflows to catch regressions in multi-step tasks.
  • Execute safety evals to detect prompt injection or data leakage vulnerabilities.
  • Integrate eval pipelines with GitHub Actions to gate deployments on evaluation outcomes.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers