subagent-testing
npx machina-cli add skill athola/claude-night-market/subagent-testing --openclawSubagent Testing - TDD for Skills
Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.
Table of Contents
- Overview
- Why Fresh Instances Matter
- Testing Methodology
- Quick Start
- Detailed Testing Guide
- Success Criteria
Overview
Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.
Why Fresh Instances Matter
The Priming Problem
Running tests in the same conversation creates bias:
- Prior context influences responses
- Skill effects get mixed with conversation history
- Can't isolate skill's true impact
Fresh Instance Benefits
- Isolation: Each test starts clean
- Reproducibility: Consistent baseline state
- Measurement: Clear before/after comparison
- Validation: Proves skill effectiveness, not priming
Testing Methodology
Three-phase TDD-style approach:
Phase 1: Baseline Testing (RED)
Test without skill to establish baseline behavior.
Phase 2: With-Skill Testing (GREEN)
Test with skill loaded to measure improvements.
Phase 3: Rationalization Testing (REFACTOR)
Test skill's anti-rationalization guardrails.
Quick Start
# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses
# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline
# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work
Detailed Testing Guide
For complete testing patterns, examples, and templates:
- Testing Patterns - Full TDD methodology
- Test Examples - Baseline, with-skill, rationalization tests
- Analysis Templates - Scoring and comparison frameworks
Success Criteria
- Baseline: Document 5+ diverse baseline scenarios
- Improvement: ≥50% improvement in skill-related metrics
- Consistency: Results reproducible across fresh instances
- Rationalization Defense: Guardrails prevent ≥80% of rationalization attempts
See Also
- skill-authoring: Creating effective skills
- bulletproof-skill: Anti-rationalization patterns
- test-skill: Automated skill testing command
Source
git clone https://github.com/athola/claude-night-market/blob/master/plugins/abstract/skills/subagent-testing/SKILL.mdView on GitHub Overview
Subagent Testing applies a three-phase TDD-style approach using fresh subagent instances to isolate skill impact from conversation history. It helps validate skill improvements, prevent priming bias, and quantify effectiveness across isolated conversations.
How This Skill Works
Tests run in isolated Claude conversations: first establish a baseline without the skill, then test with the skill loaded, and finally run rationalization tests to guard anti-rationalization patterns. Identical prompts and fresh instances ensure reproducible measurements and clear before/after comparisons, guided by defined success criteria.
When to Use It
- Validating skill improvements after updates
- Measuring a skill's impact on behavior in isolation
- Preventing priming bias in skill validation
- Reproducing results across fresh conversations
- Testing anti-rationalization guardrails and guard consistency
Quick Start
- Step 1: Create baseline tests (without skill) using 5 diverse scenarios
- Step 2: Create with-skill tests (fresh instances); load skill explicitly; use identical prompts; compare to baseline
- Step 3: Create rationalization tests; verify anti-rationalization guardrails work
Best Practices
- Use fresh subagent instances for every test to avoid context bleed.
- Document 5+ diverse baseline scenarios before loading the skill.
- Load the skill explicitly during with-skill tests and keep prompts identical.
- Record and compare full responses to baseline to quantify changes.
- Define explicit success criteria (e.g., ≥50% improvement, ≥80% defense).
Example Use Cases
- Baseline vs with-skill comparison for an updated retrieval or reasoning skill.
- Assessing a sentiment or policy-compliance skill in new conversations.
- Guardrail testing to ensure anti-rationalization patterns are triggered reliably.
- Reproducibility check across 3 fresh subagent instances for the same scenario.
- Cross-scenario consistency of skill effects after a bug fix.
Frequently Asked Questions
Related Skills
precommit-setup
athola/claude-night-market
Configure three-layer pre-commit system with linting, type checking, and testing hooks. Use for quality gate setup and code standards. Skip if pre-commit is optimally configured.
ab-test-setup
coreyhaines31/marketingskills
When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
Playwright Browser Automation
jpulido240-svg/playwright-skill
Complete browser automation with Playwright. Auto-detects dev servers, writes clean test scripts to /tmp. Test pages, fill forms, take screenshots, check responsive design, validate UX, test login flows, check links, automate any browser task. Use when user wants to test websites, automate browser interactions, validate web functionality, or perform any browser-based testing.
python-testing
athola/claude-night-market
'Consult this skill for Python testing implementation and patterns. Use
workflow-setup
athola/claude-night-market
Configure GitHub Actions CI/CD workflows for automated testing, linting, and deployment. Use for CI/CD setup and quality automation. Skip if CI/CD configured or using different platform.
RubyCritic Code Quality Analysis
esparkman/claude-rubycritic-skill
Analyze Ruby and Rails code quality with RubyCritic. Identifies code smells, complexity issues, and refactoring opportunities. Provides detailed metrics, scores files A-F, compares branches, and prioritizes high-churn problem areas. Use when analyzing Ruby code quality, reviewing PRs, or identifying technical debt.