What is the core principle of shipyard-testing?

TEST BEHAVIORS, NOT IMPLEMENTATIONS. Tests should fail only when observable behavior changes.

Should I test private methods or internal state?

No. Test via public APIs and external behavior; private methods should not drive tests.

How do I handle flaky tests?

Isolate dependencies, remove nondeterminism, and ensure tests exercise observable outcomes rather than internals.

shipyard-testing

Scanned

npx machina-cli add skill lgbarn/shipyard/shipyard-testing --openclaw

Files (1)

SKILL.md

11.2 KB

Writing Effective Tests

When This Skill Activates

Writing or modifying test files (*.test.*, *.spec.*, *_test.go, test_*.py)
Setting up test infrastructure or test utilities
Debugging flaky, brittle, or slow tests
Deciding between unit, integration, and E2E tests
Choosing when and how to use mocks, stubs, or fakes

Natural Language Triggers

"add tests", "test coverage", "testing strategy", "write tests for", "how should I test"

</activation>

Overview

Test behaviors through public APIs. Verify state, not interactions.

Core principle: If refactoring breaks your tests but not your users, your tests are wrong.

Relationship to TDD: The shipyard:shipyard-tdd skill covers WHEN to write tests (test-first, red-green-refactor). This skill covers HOW to write tests that are effective, maintainable, and trustworthy.

The Iron Law

TEST BEHAVIORS, NOT IMPLEMENTATIONS

A test should break only when the system's observable behavior changes -- never because of refactoring, renaming internals, or restructuring code.

No exceptions:

Don't test private methods
Don't assert on internal state
Don't verify method call sequences
Don't couple tests to data structures users never see

Test Structure (AAA)

Every test follows Arrange-Act-Assert:

Arrange -- Set up preconditions and inputs
Act     -- Execute the behavior under test
Assert  -- Verify the expected outcome

Separate the three sections with blank lines. One Act per test. One logical assertion per test.

<examples> <example type="good" title="Clear AAA structure, one behavior per test"> ```python def test_expired_subscription_denies_access(): # Arrange user = create_user(subscription_end=yesterday())

# Act
result = check_access(user, resource="premium-content")

# Assert
assert result.denied is True
assert result.reason == "subscription expired"

Clear name, tests one behavior, obvious structure.
</example>

<example type="bad" title="Multiple behaviors, internal state tested">
```python
def test_subscription():
    user = create_user(subscription_end=yesterday())
    assert check_access(user, "premium-content").denied
    user.subscription_end = tomorrow()
    assert check_access(user, "premium-content").allowed
    assert user.access_log == [("denied", "premium-content"), ("allowed", "premium-content")]

Two behaviors in one test, tests internal log, vague name. </example>

</examples>

Keep Tests DAMP, Not DRY

Prefer Descriptive And Meaningful Phrases over eliminating duplication. Duplicating setup across tests is fine if it makes each test self-contained and readable. Extract shared setup into helpers only when it improves clarity, not to reduce line count.

No Logic in Tests

Tests are straight-line code. No loops, conditionals, ternaries, or string concatenation. If you need logic, the test is too complex -- split it or simplify the design.

What to Test

Test These

Behaviors: What the system does from a user's perspective
Edge cases: Empty inputs, boundaries, overflow, zero, null
Error paths: Invalid input, missing dependencies, timeouts
State transitions: Before and after an operation

Skip These

Trivial getters/setters with no logic
Generated code (protobuf, ORM migrations)
Configuration files
Third-party library internals

Test Via Public APIs

Invoke the system the same way its callers do. If the only way to test something is through a private method, the design needs to change -- extract it into a collaborator with its own public interface.

Naming Tests

Name tests after the behavior, not the method.

Good	Bad
`rejects_empty_email_with_validation_error`	`test_validate`
`returns_cached_result_when_within_ttl`	`test_cache`
`retries_three_times_before_failing`	`test_retry_logic`
`grants_access_when_subscription_active`	`test_check_access_method`

Patterns that work:

[action]_[condition]_[expected_result]
should [expected behavior] when [condition]
Plain descriptive sentence

A test name containing "and" usually means two tests.

Choosing Test Level

digraph test_level {
    rankdir=TB;
    start [label="What are you testing?", shape=diamond];
    pure [label="Pure logic?\nNo I/O, no side effects", shape=diamond];
    boundary [label="System boundary?\nDB, API, file, queue", shape=diamond];
    journey [label="Critical user journey?\nLogin, checkout, signup", shape=diamond];

    unit [label="UNIT TEST\nFast, isolated, many", shape=box, style=filled, fillcolor="#ccffcc"];
    integration [label="INTEGRATION TEST\nBoundary focused, fewer", shape=box, style=filled, fillcolor="#ffffcc"];
    e2e [label="E2E TEST\nFull stack, minimal", shape=box, style=filled, fillcolor="#ffcccc"];

    start -> pure;
    pure -> unit [label="yes"];
    pure -> boundary [label="no"];
    boundary -> integration [label="yes"];
    boundary -> journey [label="no"];
    journey -> e2e [label="yes"];
    journey -> unit [label="no\n(rethink)"];
}

The pyramid: Many unit tests. Fewer integration tests. Minimal E2E tests.

Push tests down. If a behavior can be tested at a lower level, test it there. Lower = faster, more stable, cheaper to maintain. Only go higher when the lower level can't verify the behavior (serialization, wiring, full user flow).

Duplicate coverage? If a unit test and integration test verify the same behavior, keep the unit test. Remove the integration test unless it adds confidence about wiring.

Test Isolation and Doubles

When to Use Test Doubles

Use doubles for	Use real objects for
Network calls (HTTP, gRPC)	In-process collaborators
Database queries (in unit tests)	Value objects and data structures
Filesystem I/O	Pure functions
Non-deterministic sources (time, random)	Deterministic transformations
Expensive operations (ML inference, rendering)	Lightweight dependencies

Types of Doubles

Stub: Returns canned responses. Use when you need a dependency to return specific data.
Fake: Working implementation with shortcuts (in-memory DB, local file store). Preferred over mocks for complex interactions.
Mock: Records calls and asserts on them. Use sparingly -- only when the interaction IS the behavior (e.g., "sends an email").

The Mocking Rule

Verify STATE over INTERACTIONS.
Mock only when the side effect IS the behavior under test.

<examples> <example type="good" title="Testing resulting state"> ```typescript // Testing that withdrawal updates balance (state) test('withdrawal reduces account balance', () => { const account = new Account(100);

account.withdraw(30);

expect(account.balance).toBe(70); });

Tests resulting state -- survives internal refactoring.
</example>

<example type="bad" title="Testing internal interaction">
```typescript
// Testing that withdrawal calls the right methods (interaction)
test('withdrawal calls debit', () => {
  const ledger = mock(Ledger);
  const account = new Account(100, ledger);

  account.withdraw(30);

  expect(ledger.debit).toHaveBeenCalledWith(30);
});

Tests internal interaction -- breaks if refactored. </example>

</examples>

Unchanging Tests

A well-written test changes only when requirements change. It survives:

Refactoring -- internal restructuring, no behavior change
Bug fixes -- new tests added, existing tests unchanged
New features -- unrelated tests unaffected

If your tests break during refactoring, they're testing implementation details. Fix the tests to test behavior instead.

Clear Failure Messages

A failure message should let you diagnose the problem without reading the test code.

<examples> <example type="good" title="Descriptive failure message"> ``` FAIL: rejects_empty_email_with_validation_error Expected: ValidationError("email is required") Actual: None (no error raised) ``` Shows expected vs actual with context. </example> <example type="bad" title="Opaque failure message"> ``` FAIL: test_validate AssertionError: False is not True ``` No context, unclear what was tested. </example> </examples>

Use assertion libraries that produce descriptive messages. Prefer assertEqual(expected, actual) over bare assert.

</instructions> <rules>

Anti-Patterns

Anti-Pattern	Problem	Fix
Testing implementation	Breaks on refactor	Test observable behavior via public API
Over-mocking	Tests verify mocks, not code	Use real objects; mock only at boundaries
Flaky tests	Non-deterministic (time, order, network)	Control inputs; isolate state; use fakes
Ice cream cone	Too many E2E, few unit tests	Push tests down the pyramid
Logic in tests	Bugs in test code hide bugs in prod	Straight-line code only
Shared mutable state	Tests depend on run order	Fresh state per test (setup/teardown)
Test-only production code	Methods exist only for testing	Test via public API; redesign if needed
Assertion-free tests	Test runs but verifies nothing	Every test must assert expected outcome
Giant setup	50 lines before the act	Simplify design; extract builder/factory

Red Flags -- STOP and Rethink

Mocking more than 2 dependencies -- class has too many responsibilities
Test setup exceeds 20 lines -- design is too complex or test scope too broad
Test name contains "and" -- split into two tests
Tests break when you refactor -- testing implementation, not behavior
Tests pass but bugs ship -- testing the wrong things
Can't explain what behavior a test verifies -- rewrite or delete it
All tests use mocks -- not testing real behavior
Test file is longer than production file -- over-testing or testing internals

Common Rationalizations

Excuse	Reality
"Need mocks for everything"	Design is too coupled. Inject dependencies, use fakes.
"100% coverage = quality"	Coverage measures execution, not correctness. 80% covering behaviors > 100% covering lines.
"E2E tests cover everything"	Slow, flaky, expensive. Unit tests catch most bugs faster.
"Test is too hard to write"	Hard to test = hard to use. Fix the design.
"It's just test code"	Test code deserves production standards. Sloppy tests rot fast.
"One big test is easier"	One big test = one useless failure message. Split by behavior.
"Helper makes it DRY"	If the helper obscures what's tested, inline it. DAMP > DRY.

</rules>

Quick Reference: FIRST Principles

Fast -- Tests run in seconds, not minutes. Slow tests get skipped.
Independent -- No test depends on another. Any order, any subset.
Repeatable -- Same result every run, any machine, any time.
Self-checking -- Pass/fail without human inspection.
Timely -- Written close to the code. Not "later." Not "before release."

Final Rule

Test what your users care about, through the interfaces they use.
Everything else is an implementation detail.

Source

git clone https://github.com/lgbarn/shipyard/blob/main/skills/shipyard-testing/SKILL.mdView on GitHub

Overview

Shipyard-testing guides you to verify observable behavior via public APIs, not internal details. The core principle is that if a refactor breaks tests but not users, the tests are wrong. It also clarifies when to test and how to structure tests (AAA) for maintainability.

How This Skill Works

Tests follow the AAA pattern (Arrange-Act-Assert) with one behavior per test and a clear separation between setup, execution, and verification. Tests should exercise the system through public interfaces, avoid private methods or internal state checks, and adhere to the iron law: TEST BEHAVIORS, NOT IMPLEMENTATIONS. This reduces brittleness and focuses on user-visible outcomes.

When to Use It

Writing or modifying test files (*.test.*, *.spec.*, *_test.go, test_*.py)
Setting up test infrastructure or test utilities
Debugging flaky, brittle, or slow tests
Deciding between unit, integration, and E2E tests
Choosing when and how to use mocks, stubs, or fakes

Quick Start

Step 1: Identify the observable behavior to test via the public API surface
Step 2: Write a single-behavior AAA test with clear Arrange, Act, Assert sections
Step 3: Run tests, address failures, and refactor to reduce flakiness

Best Practices

Follow AAA: Arrange, Act, Assert with blank lines and one act per test
Test behaviors via public APIs; avoid testing private methods
Name tests after the behavior, not the method
Keep tests focused on a single behavioral assertion; avoid complex logic in tests
Prefer descriptive and meaningful phrases; keep tests self-contained and readable

Example Use Cases

Good AAA example: test_expired_subscription_denies_access with Arrange, Act, Assert
Bad example: test_subscription tests multiple behaviors and internal state
Public API test: test_access_premium_content_through_public_api
Edge-case test: test_empty_input_returns_error via public API
Naming emphasis: name tests after behavior, not internals

Frequently Asked Questions

Add this skill to your agents