What is stream-json output used for?

To deterministically parse tool/skill/agent invocations for comparison against test specs.

Why restrict imports to lib/ helpers?

Keeps tests platform-agnostic and enforces a single access surface to Node.js.

What are the test tiers and goals?

Unit checks pure functions and structure; Workflow verifies LLM invocations; Integration ensures the generated project actually works.

Plugin Testing Standards

Use Caution

npx machina-cli add skill LiorCohen/sdd/plugin-testing-standards --openclaw

Files (1)

SKILL.md

8.0 KB

Plugin Testing Standards

Testing methodology for Claude Code plugins ensuring deterministic verification of LLM-driven workflows.

Core Principles

1. Separation of Concerns

tests/
├── lib/                    # All helper/utility code (wraps Node.js)
│   ├── index.ts            # Re-exports everything
│   ├── paths.ts            # Directory constants
│   ├── fs.ts               # File system operations
│   ├── process.ts          # Command execution
│   ├── claude.ts           # Claude CLI helpers
│   └── http.ts             # HTTP utilities
└── tests/                  # Test files (NO direct node:* imports)
    ├── unit/               # No LLM required
    ├── workflows/          # LLM with deterministic verification
    └── integration/        # Full functional verification

2. No Direct Node.js Imports in Tests

Test files must NOT import from node:* directly. All Node.js functionality is accessed through lib/ helpers.

// BAD - direct node import
import * as fs from 'node:fs';
import * as path from 'node:path';

// GOOD - use lib helpers
import { readFile, joinPath, fileExists } from '../lib/index.js';

3. File Size Limit: 300 Lines

If a test file exceeds 300 lines, split it into a directory with multiple smaller files.

tests/unit/large-feature.test.ts  (350 lines - TOO BIG)

# Split into:
tests/unit/large-feature/
├── core.test.ts           (~100 lines)
├── validation.test.ts     (~120 lines)
└── integration.test.ts    (~130 lines)

4. WHY Comments on Every Test

Every describe and it block must have a WHY comment explaining the business/technical value, not the mechanics.

/**
 * WHY: Ensures scaffolding substitutes project name variables.
 * Without this, generated projects have {{PROJECT_NAME}} literals
 * in package.json, breaking npm install.
 */
it('substitutes {{PROJECT_NAME}} in templates', async () => { ... });

Test Tiers

Tier	Name	LLM	Purpose	Duration
1	Unit	No	Test pure functions, templates, structure	< 10s
2	Workflow	Yes	Verify correct agent/skill invocations	< 15min
3	Integration	Yes	Verify generated output actually works	< 20min

Tier 1: Unit Tests

Pure TypeScript tests with no Claude involved.

Test scaffolding scripts directly
Validate template files exist and have correct structure
Validate plugin structure (commands, agents, skills)
No network calls, no LLM

Tier 2: Workflow Tests

Run Claude with predefined inputs, parse output, verify invocations.

Capture stream-json output
Parse tool/skill/agent invocations
Compare to expected behavior
Deterministic pass/fail

Tier 3: Integration Tests

Verify generated output actually works.

Run npm install on generated projects
Run npm run build to verify TypeScript compiles
Start servers and verify they respond
Tests expose real issues in scaffolding/templates

Deterministic LLM Testing

Approach

Run Claude in non-interactive mode with predefined inputs
Capture structured output via --output-format stream-json
Parse tool/skill/agent invocations from JSON
Compare to expected behavior defined in test specs

Stream-JSON Output Structure

{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Skill","input":{"skill":"init"}}]}}
{"type":"assistant","message":{"content":[{"type":"tool_use","name":"Task","input":{"subagent_type":"spec-writer"}}]}}

Parsing Helper

interface ParsedOutput {
  readonly toolUses: readonly ToolUse[];
  readonly skillInvocations: readonly string[];
  readonly agentInvocations: readonly string[];
}

const parseClaudeOutput = (output: string): ParsedOutput => {
  const toolUses: ToolUse[] = [];
  const skillInvocations: string[] = [];
  const agentInvocations: string[] = [];

  for (const line of output.split('\n')) {
    try {
      const event = JSON.parse(line);
      if (event.type === 'assistant' && event.message?.content) {
        for (const content of event.message.content) {
          if (content.type === 'tool_use') {
            toolUses.push({ name: content.name, input: content.input, id: content.id });
            if (content.name === 'Skill') skillInvocations.push(content.input.skill);
            if (content.name === 'Task') agentInvocations.push(content.input.subagent_type);
          }
        }
      }
    } catch { /* skip non-JSON */ }
  }

  return { toolUses, skillInvocations, agentInvocations };
};

Prompt Engineering for Determinism

Include these instructions in all automated test prompts:

THIS IS AN AUTOMATED TEST. You MUST:
1. Skip ALL discovery questions and use the values above
2. Skip approval steps - consider it PRE-APPROVED
3. Execute ALL steps through completion
4. Do NOT stop for user input at any point
5. Create ALL files in the CURRENT WORKING DIRECTORY (.) - do NOT use absolute paths

Integration Test Pattern

/**
 * WHY: Verifies that init generates projects that actually compile.
 * Catches issues like invalid TypeScript, missing dependencies, or
 * broken import paths that would break users immediately.
 */
describe('init functional verification', () => {
  /**
   * WHY: npm install must succeed for users to run the project.
   * Catches invalid package.json, missing dependencies, or
   * dependency version conflicts.
   */
  it('generated project installs dependencies', async () => {
    const result = await runClaude(PROMPT, testDir, 300);
    expect(result.exitCode).toBe(0);

    const installResult = await runCommand('npm', ['install'], { cwd: projectDir });
    expect(installResult.exitCode).toBe(0);
  });

  /**
   * WHY: TypeScript must compile for the project to be usable.
   * Catches type errors, missing type definitions, or invalid
   * tsconfig settings in templates.
   */
  it('generated project builds successfully', async () => {
    const buildResult = await runCommand('npm', ['run', 'build'], { cwd: serverDir });
    expect(buildResult.exitCode).toBe(0);
  });
});

Critical Principle: Fix Code, Not Tests

If an integration test fails:

The fix belongs in scaffolding.ts or templates
NOT in test assertions or expectations
Tests exist to catch real issues in generated output

# BAD: Weakening test to pass
- expect(buildResult.exitCode).toBe(0);
+ expect(buildResult.exitCode).toBeLessThan(2); // "allow warnings"

# GOOD: Fix the actual issue
# Edit scaffolding.ts or template files to fix the build error

Directory Structure Template

tests/{plugin-name}/
├── src/
│   ├── lib/
│   │   ├── index.ts
│   │   ├── paths.ts
│   │   ├── fs.ts
│   │   ├── process.ts
│   │   ├── claude.ts
│   │   └── http.ts
│   └── tests/
│       ├── unit/
│       │   └── {feature}/
│       │       └── {concern}.test.ts
│       ├── workflows/
│       │   ├── {command}.test.ts
│       │   └── {skill}/
│       │       └── {scenario}.test.ts
│       └── integration/
│           └── {command}-functional.test.ts
├── package.json
├── tsconfig.json
└── vitest.config.ts

NPM Scripts

{
  "test": "vitest run",
  "test:unit": "vitest run src/tests/unit/",
  "test:workflows": "vitest run src/tests/workflows/",
  "test:integration": "vitest run src/tests/integration/",
  "test:ci": "vitest run src/tests/unit/",
  "test:all": "vitest run"
}

Success Criteria

Unit tests complete in < 10 seconds (no LLM)
Workflow tests are deterministic - same input produces same pass/fail
Integration tests verify generated projects actually build and run
Failures clearly identify what invocation was missing or wrong
Test failures indicate issues in plugin code, NOT in tests
All test files are < 300 lines
All test blocks have WHY comments

Source

git clone https://github.com/LiorCohen/sdd/blob/main/.claude/skills/plugin-testing-standards/SKILL.mdView on GitHub

Overview

Defines a deterministic testing workflow for Claude Code plugins, codifying project structure, test boundaries, and proven methods to verify LLM-driven workflows. It enforces separation of concerns, avoids direct Node.js imports in tests, and uses stream-json output to compare actual invocations against expectations.

How This Skill Works

Tests live under tests/ with a lib/ helper layer and tiered test files (unit, workflow, integration). Claude runs in non-interactive mode with predefined inputs and --output-format stream-json; the emitted JSON is parsed to verify tool/skill/agent invocations against the spec.

When to Use It

Validate scaffolding and plugin structure
Verify deterministic LLM invocations in workflows
Enforce test size limit and split large files
Document WHY for every test to capture business value
Run end-to-end integration checks of generated projects

Quick Start

Step 1: Create tests/ structure with lib/ and unit/workflow/integration tests
Step 2: Run Claude in non‑interactive mode with predefined inputs and --output-format stream-json
Step 3: Parse JSON output and compare against expected invocations/specs

Best Practices

Use lib/ helpers for all Node.js functionality
Avoid direct node:* imports in tests
Keep test files under 300 lines; split large files
Add WHY comments to every describe/it describing business value
Validate plugin structure (commands, agents, skills) and templates in unit tests

Example Use Cases

Unit tests validating plugin scaffolding and template structure
Workflow tests capturing and validating stream-json tool invocations
Integration tests building and running a generated project
GOOD vs BAD test patterns: using lib/ vs direct node imports
Example of splitting a large test into multiple files when >300 lines

Frequently Asked Questions

Add this skill to your agents