Get the FREE Ultimate OpenClaw Setup Guide →

subagent-testing

npx machina-cli add skill athola/claude-night-market/subagent-testing --openclaw
Files (1)
SKILL.md
3.0 KB

Subagent Testing - TDD for Skills

Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.

Table of Contents

  1. Overview
  2. Why Fresh Instances Matter
  3. Testing Methodology
  4. Quick Start
  5. Detailed Testing Guide
  6. Success Criteria

Overview

Fresh instances prevent priming: Each test uses a new Claude conversation to verify the skill's impact is measured, not conversation history effects.

Why Fresh Instances Matter

The Priming Problem

Running tests in the same conversation creates bias:

  • Prior context influences responses
  • Skill effects get mixed with conversation history
  • Can't isolate skill's true impact

Fresh Instance Benefits

  • Isolation: Each test starts clean
  • Reproducibility: Consistent baseline state
  • Measurement: Clear before/after comparison
  • Validation: Proves skill effectiveness, not priming

Testing Methodology

Three-phase TDD-style approach:

Phase 1: Baseline Testing (RED)

Test without skill to establish baseline behavior.

Phase 2: With-Skill Testing (GREEN)

Test with skill loaded to measure improvements.

Phase 3: Rationalization Testing (REFACTOR)

Test skill's anti-rationalization guardrails.

Quick Start

# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses

# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline

# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work

Detailed Testing Guide

For complete testing patterns, examples, and templates:

Success Criteria

  • Baseline: Document 5+ diverse baseline scenarios
  • Improvement: ≥50% improvement in skill-related metrics
  • Consistency: Results reproducible across fresh instances
  • Rationalization Defense: Guardrails prevent ≥80% of rationalization attempts

See Also

  • skill-authoring: Creating effective skills
  • bulletproof-skill: Anti-rationalization patterns
  • test-skill: Automated skill testing command

Source

git clone https://github.com/athola/claude-night-market/blob/master/plugins/abstract/skills/subagent-testing/SKILL.mdView on GitHub

Overview

Subagent Testing applies a three-phase TDD-style approach using fresh subagent instances to isolate skill impact from conversation history. It helps validate skill improvements, prevent priming bias, and quantify effectiveness across isolated conversations.

How This Skill Works

Tests run in isolated Claude conversations: first establish a baseline without the skill, then test with the skill loaded, and finally run rationalization tests to guard anti-rationalization patterns. Identical prompts and fresh instances ensure reproducible measurements and clear before/after comparisons, guided by defined success criteria.

When to Use It

  • Validating skill improvements after updates
  • Measuring a skill's impact on behavior in isolation
  • Preventing priming bias in skill validation
  • Reproducing results across fresh conversations
  • Testing anti-rationalization guardrails and guard consistency

Quick Start

  1. Step 1: Create baseline tests (without skill) using 5 diverse scenarios
  2. Step 2: Create with-skill tests (fresh instances); load skill explicitly; use identical prompts; compare to baseline
  3. Step 3: Create rationalization tests; verify anti-rationalization guardrails work

Best Practices

  • Use fresh subagent instances for every test to avoid context bleed.
  • Document 5+ diverse baseline scenarios before loading the skill.
  • Load the skill explicitly during with-skill tests and keep prompts identical.
  • Record and compare full responses to baseline to quantify changes.
  • Define explicit success criteria (e.g., ≥50% improvement, ≥80% defense).

Example Use Cases

  • Baseline vs with-skill comparison for an updated retrieval or reasoning skill.
  • Assessing a sentiment or policy-compliance skill in new conversations.
  • Guardrail testing to ensure anti-rationalization patterns are triggered reliably.
  • Reproducibility check across 3 fresh subagent instances for the same scenario.
  • Cross-scenario consistency of skill effects after a bug fix.

Frequently Asked Questions

Add this skill to your agents

Related Skills

precommit-setup

athola/claude-night-market

Configure three-layer pre-commit system with linting, type checking, and testing hooks. Use for quality gate setup and code standards. Skip if pre-commit is optimally configured.

ab-test-setup

coreyhaines31/marketingskills

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.

Playwright Browser Automation

jpulido240-svg/playwright-skill

Complete browser automation with Playwright. Auto-detects dev servers, writes clean test scripts to /tmp. Test pages, fill forms, take screenshots, check responsive design, validate UX, test login flows, check links, automate any browser task. Use when user wants to test websites, automate browser interactions, validate web functionality, or perform any browser-based testing.

python-testing

athola/claude-night-market

'Consult this skill for Python testing implementation and patterns. Use

workflow-setup

athola/claude-night-market

Configure GitHub Actions CI/CD workflows for automated testing, linting, and deployment. Use for CI/CD setup and quality automation. Skip if CI/CD configured or using different platform.

RubyCritic Code Quality Analysis

esparkman/claude-rubycritic-skill

Analyze Ruby and Rails code quality with RubyCritic. Identifies code smells, complexity issues, and refactoring opportunities. Provides detailed metrics, scores files A-F, compares branches, and prioritizes high-churn problem areas. Use when analyzing Ruby code quality, reviewing PRs, or identifying technical debt.

Sponsor this space

Reach thousands of developers