What is subagent-testing?

A TDD-style approach that uses fresh subagent instances to measure a skill's impact while avoiding priming bias.

When should I use subagent-testing?

Use for validating skill improvements, testing skill effectiveness, preventing priming bias, and measuring impact on behavior. Do not use when implementing skills or creating hooks; use skill-authoring or hook-authoring instead.

How are success criteria defined?

Define baseline with 5+ diverse scenarios, aim for ≥50% improvement in skill-related metrics, ensure results are reproducible across fresh instances, and verify rationalization defense blocks ≥80% of attempts.

subagent-testing

Scanned

testing

npx machina-cli add skill athola/claude-night-market/subagent-testing --openclaw

Files (1)

SKILL.md

3.0 KB

Subagent Testing - TDD for Skills

Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.

Overview
Why Fresh Instances Matter
Testing Methodology
Quick Start
Detailed Testing Guide
Success Criteria

Overview

Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.

Why Fresh Instances Matter

The Priming Problem

Running tests in the same conversation creates bias:

Prior context influences responses
Skill effects get mixed with conversation history
Can't isolate skill's true impact

Fresh Instance Benefits

Isolation: Each test starts clean
Reproducibility: Consistent baseline state
Measurement: Clear before/after comparison
Validation: Proves skill effectiveness, not priming

Testing Methodology

Three-phase TDD-style approach:

Phase 1: Baseline Testing (RED)

Test without skill to establish baseline behavior.

Phase 2: With-Skill Testing (GREEN)

Test with skill loaded to measure improvements.

Phase 3: Rationalization Testing (REFACTOR)

Test skill's anti-rationalization guardrails.

Quick Start

# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses

# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline

# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work

Detailed Testing Guide

For complete testing patterns, examples, and templates:

Testing Patterns - Full TDD methodology
Test Examples - Baseline, with-skill, rationalization tests
Analysis Templates - Scoring and comparison frameworks

Success Criteria

Baseline: Document 5+ diverse baseline scenarios
Improvement: ≥50% improvement in skill-related metrics
Consistency: Results reproducible across fresh instances
Rationalization Defense: Guardrails prevent ≥80% of rationalization attempts

Overview

Subagent Testing applies a three-phase TDD-style approach using fresh subagent instances to isolate skill impact from conversation history. It helps validate skill improvements, prevent priming bias, and quantify effectiveness across isolated conversations.

How This Skill Works

Tests run in isolated Claude conversations: first establish a baseline without the skill, then test with the skill loaded, and finally run rationalization tests to guard anti-rationalization patterns. Identical prompts and fresh instances ensure reproducible measurements and clear before/after comparisons, guided by defined success criteria.

When to Use It

Validating skill improvements after updates
Measuring a skill's impact on behavior in isolation
Preventing priming bias in skill validation
Reproducing results across fresh conversations
Testing anti-rationalization guardrails and guard consistency

Quick Start

Step 1: Create baseline tests (without skill) using 5 diverse scenarios
Step 2: Create with-skill tests (fresh instances); load skill explicitly; use identical prompts; compare to baseline
Step 3: Create rationalization tests; verify anti-rationalization guardrails work

Best Practices

Use fresh subagent instances for every test to avoid context bleed.
Document 5+ diverse baseline scenarios before loading the skill.
Load the skill explicitly during with-skill tests and keep prompts identical.
Record and compare full responses to baseline to quantify changes.
Define explicit success criteria (e.g., ≥50% improvement, ≥80% defense).

Example Use Cases

Baseline vs with-skill comparison for an updated retrieval or reasoning skill.
Assessing a sentiment or policy-compliance skill in new conversations.
Guardrail testing to ensure anti-rationalization patterns are triggered reliably.
Reproducibility check across 3 fresh subagent instances for the same scenario.
Cross-scenario consistency of skill effects after a bug fix.

Frequently Asked Questions

Add this skill to your agents

Related Skills

precommit-setup

athola/claude-night-market

Configure three-layer pre-commit system with linting, type checking, and testing hooks. Use for quality gate setup and code standards. Skip if pre-commit is optimally configured.

ab-test-setup

coreyhaines31/marketingskills

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.

workflow-setup

athola/claude-night-market

Configure GitHub Actions CI/CD workflows for automated testing, linting, and deployment. Use for CI/CD setup and quality automation. Skip if CI/CD configured or using different platform.

python-testing

athola/claude-night-market

'Consult this skill for Python testing implementation and patterns. Use

RubyCritic Code Quality Analysis

esparkman/claude-rubycritic-skill

Analyze Ruby and Rails code quality with RubyCritic. Identifies code smells, complexity issues, and refactoring opportunities. Provides detailed metrics, scores files A-F, compares branches, and prioritizes high-churn problem areas. Use when analyzing Ruby code quality, reviewing PRs, or identifying technical debt.

multi-review

Pamacea/smite

MANDATORY gate BEFORE merging PR or deploying to production in smite project. Invoke FIRST when 'comprehensive review', 'check security', 'performance review', 'test coverage review', 'code quality audit' - orchestrates parallel review by 4 specialized agents (security, performance, testing, documentation) with consolidated report and scoring. Specific phrases: 'review this PR', 'security audit', 'performance check', 'test review'. (user)

subagent-testing

Subagent Testing - TDD for Skills

Table of Contents

Overview

Why Fresh Instances Matter

The Priming Problem

Fresh Instance Benefits

Testing Methodology

Phase 1: Baseline Testing (RED)

Phase 2: With-Skill Testing (GREEN)

Phase 3: Rationalization Testing (REFACTOR)

Quick Start

Detailed Testing Guide

Success Criteria

See Also

Source

Overview

How This Skill Works

When to Use It

Quick Start

Best Practices

Example Use Cases

Frequently Asked Questions

Related Skills

precommit-setup

ab-test-setup

workflow-setup

python-testing

RubyCritic Code Quality Analysis

multi-review