What is an AI Evals Pack?

A structured set of deliverables including an Eval PRD, test set, rubric, judge plan, and results, used to evaluate AI features and support ship or no ship decisions.

When should I not use ai-evals?

Use ai-evals for evaluating LLM features and creating repeatable evaluation artifacts. Do not use it for deciding what to build or for traditional non-LLM software testing or model training infra design.

How are new bugs handled?

In ai-evals, every bug becomes a new test in the iteration loop, creating regression coverage and guiding next improvements.

ai-evals

Scanned

npx machina-cli add skill liqiongyu/lenny_skills_plus/ai-evals --openclaw

Files (1)

SKILL.md

7.5 KB

AI Evals

Scope

Covers

Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

“Design evals for this LLM feature so we can ship with confidence.”
“Create a rubric + golden set + benchmark for our AI assistant/copilot.”
“We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
“Compare prompts/models safely with a clear acceptance threshold.”

When NOT to use

You need to decide what to build (use problem-definition, building-with-llms, or ai-product-strategy).
You’re primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
You only want vendor/model selection with no defined task + data (use evaluating-new-technology first, then come back with a concrete use case).

Inputs

Minimum required

System under test (SUT): what the AI does, for whom, in what workflow (inputs → outputs)
The decision the eval must support (ship/no-ship, compare options, regression gate)
What “good” means: 3–10 target behaviors + top failure modes
Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3–5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2–3 viable options (judge type, scoring scheme, dataset size).
If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

Outputs (deliverables)

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
Error taxonomy (from error analysis + open coding): failure modes, severity, examples
Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (7 steps)

1) Define the decision and write the Eval PRD

Inputs: SUT description, stakeholders, decision to support.
Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
Checks: A stakeholder can restate what is being measured, why, and what “pass” means.

2) Draft the golden set structure + coverage plan

Inputs: User workflows, edge cases, safety risks, data availability.
Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
Outputs: Test set spec + initial golden set.
Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

3) Run error analysis and open coding to build a taxonomy

Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
Outputs: Error taxonomy + “top failure modes” list.
Checks: Taxonomy is mutually understandable by PM/eng; each category has 1–2 concrete examples.

4) Convert taxonomy → rubric + scoring rules

Inputs: Taxonomy, target behaviors, output formats.
Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
Outputs: Rubric + judging guide.
Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

5) Choose the judging approach + harness/runbook

Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
Outputs: Judge + harness plan.
Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

6) Define reporting, thresholds, and the iteration loop

Inputs: Stakeholder needs, release cadence.
Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
Outputs: Reporting + iteration loop.
Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

7) Quality gate + finalize

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
Outputs: Final AI Evals Pack.
Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

Quality gate (required)

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

Examples

Example 1 (answer quality + safety): “Use ai-evals to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.”

Example 2 (structured extraction): “Use ai-evals to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for amount and due_date. Output: AI Evals Pack.”

Boundary example: “We don’t know what the AI feature should do yet—just ‘add AI’ and pick a model.”
Response: out of scope; first define the job/spec and success metrics (use problem-definition or building-with-llms), then return to ai-evals with a concrete SUT.

Source

git clone https://github.com/liqiongyu/lenny_skills_plus/blob/main/skills/ai-evals/SKILL.mdView on GitHub

Overview

AI Evals enables you to design evaluation contracts for LLM features, turning failures into structured artifacts such as golden test sets, error taxonomies, and rubrics. It also defines judging approaches and a repeatable iteration loop to produce decision-ready results for ship/no-ship gates.

How This Skill Works

Start by defining the decision and drafting the Eval PRD, then build a golden test set and an error taxonomy. Next create a rubric and a judge plan, followed by a harness and runbook to execute evaluations, with results feeding an iteration loop that turns bugs into new tests.

When to Use It

Design evals for this LLM feature so we can ship with confidence.
Create a rubric + golden set + benchmark for our AI assistant/copilot.
We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.
Compare prompts/models safely with a clear acceptance threshold.
Turn bugs into new tests to support regression safety.

Quick Start

Step 1: Define the decision and draft the Eval PRD.
Step 2: Draft the golden set structure, test case schema, and coverage plan.
Step 3: Create the error taxonomy, rubric, judge plan, harness, and start the iteration loop.

Best Practices

Define the decision and acceptance thresholds in the Eval PRD.
Specify test case schema, coverage targets, and a starter set.
Create an explicit error taxonomy with severity and examples.
Document a clear judge plan and calibration steps (human, LLM, automated checks).
Treat every bug as a trigger to generate a new test in the iteration loop.

Example Use Cases

Eval PRD based rollout for an AI assistant feature with clear acceptance thresholds.
Golden set and rubric to benchmark an AI copilot's responses.
Error taxonomy used to diagnose flaky QA in a chatbot.
Judge plan with human-in-the-loop for model ranking and tie-breakers.
Iteration loop converting reported bugs into new tests and regression gates.

Frequently Asked Questions

Add this skill to your agents