qa-run is an 8-agent QA loop that starts with live browser exploration using Playwright MCP and proceeds through analysis, planning, testing, auditing, healing, expanding, and snapshot steps to build robust test coverage.

What does the 85 quality gate mean?

The feature must achieve a quality gate score of 85 or higher to pass; otherwise, issues are surfaced for remediation and re-testing.

What outputs does qa-run generate?

qa/run produces exploration reports and plans, including qa/browser-tests/$FEATURE/exploration.md and qa/plans/$FEATURE.md, plus initial Playwright test files for refinement.

qa-run

Scanned

npx machina-cli add skill ajaywadhara/agentic-sdlc-plugin/qa-run --openclaw

Files (1)

SKILL.md

9.9 KB

Arguments: $FEATURE (or "all" for entire suite)

Read CLAUDE.md before doing anything else. Ensure the dev server is running before proceeding.

━━━ BROWSER AGENT (Playwright MCP — runs first) ━━━

Before writing ANY test file, explore the live application using Playwright MCP. This is non-negotiable for web applications. You MUST see the real app first.

STEP A — NAVIGATE EVERY SCREEN: For each screen related to $FEATURE: 1. browser_navigate to the screen URL 2. browser_snapshot — capture the accessibility tree (This gives you the REAL selectors, roles, and accessible names. Never guess selectors. Always get them from the live app.) 3. browser_take_screenshot — save to qa/browser-tests/$FEATURE/ 4. Compare what you see against docs/SCREENS.md and wireframes/ 5. Log any discrepancies immediately

STEP B — TEST EVERY INTERACTION: On each screen: 1. browser_click every button — verify correct result 2. browser_type into every input — verify it accepts input 3. browser_select_option on every dropdown 4. browser_press_key Tab through the page — verify focus order 5. browser_press_key Enter on focused buttons — verify activation 6. For forms: submit with valid data, empty data, and invalid data

STEP C — TEST THE HAPPY PATH LIVE: Read P0 acceptance criteria from docs/PRD.md. Execute each Given/When/Then by actually doing it in the browser: - browser_navigate to start - browser_type / browser_click / browser_select_option to perform actions - browser_verify_text_visible / browser_verify_element_visible for assertions - browser_take_screenshot at each step

STEP D — RESPONSIVE CHECK: For the 3 most important screens: browser_resize width=1440 height=900 → browser_take_screenshot (desktop) browser_resize width=768 height=1024 → browser_take_screenshot (tablet) browser_resize width=375 height=812 → browser_take_screenshot (mobile) Verify: no overflow, no cut-off content, touch targets ≥ 44px on mobile

STEP E — HEALTH CHECK: browser_console_messages — flag any JavaScript errors or warnings browser_network_requests — flag any failed requests (4xx/5xx)

STEP F — GENERATE INITIAL TEST FILES: Use browser_generate_playwright_test to create .spec.ts files from your session. Save to: tests/e2e/$FEATURE-browser.spec.ts These become the foundation that the Engineer Agent refines below.

Output: qa/browser-tests/$FEATURE/exploration.md (Summary of what was found: working elements, broken elements, missing elements, selectors discovered, accessibility tree findings)

━━━ ANALYST AGENT ━━━

Read the source code for $FEATURE. Read qa/plans/ for any existing test coverage on this feature. Read qa/browser-tests/$FEATURE/exploration.md for live browser findings.

Map every testable surface:

Every user-facing interaction (clicks, inputs, form submissions, keyboard nav)
Every API call this feature makes and its possible response shapes
Every UI state: loading, empty, error, success, partial data
Every data-testid attribute or accessible role present in the DOM
Every validation rule (client-side and server-side)
Every route or navigation this feature triggers

Output: qa/plans/$FEATURE.md

━━━ PLANNER AGENT ━━━

Read qa/plans/$FEATURE.md. Assign priority and write a Given/When/Then for each:

P0 — "If this breaks, the product is unusable" (auth flows, data saving, core feature paths) P1 — "If this breaks, a significant feature is degraded" (secondary flows, important edge cases) P2 — "Edge case — good to have covered" (unusual inputs, rare states, nice-to-have validation)

Also include for each screen:

Empty state scenario (user has no data yet)
Error state scenario (network fails, server returns 500)
Mobile viewport scenario (at least for P0 items)

Output: qa/plans/$FEATURE-prioritized.md

━━━ ENGINEER AGENT ━━━

CRITICAL: The dev server must be running. Use Playwright MCP to navigate the actual, running application before writing any test.

For each scenario in qa/plans/$FEATURE-prioritized.md:

Navigate to the relevant route using Playwright MCP
Confirm the element you intend to target is visible and accessible
Note the exact accessible role, label, or testId
Then write the Playwright test

Write all tests to: tests/e2e/$FEATURE.spec.ts

Playwright rules — these are absolute, no exceptions: ALLOWED: getByRole('button', { name: 'Save' }) ALLOWED: getByLabel('Email address') ALLOWED: getByText('No transactions yet') ALLOWED: getByTestId('transaction-list') FORBIDDEN: page.$('.save-btn') FORBIDDEN: page.$('#submit') FORBIDDEN: page.$x('//button[@class="primary"]') FORBIDDEN: page.waitForTimeout(3000) <- use expect().toBeVisible() instead

Every test must:

Have a descriptive name explaining what it verifies
Assert a specific, meaningful outcome (not just "doesn't crash")
Use proper async/await throughout
Clean up any data it creates (use beforeEach/afterEach hooks)

━━━ SENTINEL AGENT ━━━

Read tests/e2e/$FEATURE.spec.ts line by line.

BLOCK (stop QA loop, return to Engineer) if any of these exist:

Any selector containing "." or "#" or "//"
Any action missing an await keyword
Any test block with zero assertions (expect() calls)
Any page.waitForTimeout() greater than 2000ms
Any test that only navigates and clicks with no assertion

WARN (flag but do not block) for:

Test names that don't clearly describe the scenario
Missing afterEach cleanup for data-creating tests
Tests that could affect each other's state

Output: qa/audits/$FEATURE-audit.md

If blockers found: list exact line numbers. Return to Engineer. If no blockers: proceed.

━━━ EXECUTION ━━━

Run: npx playwright test tests/e2e/$FEATURE.spec.ts --reporter=json Save full output to: qa/runs/$FEATURE-latest.json

━━━ HEALER AGENT (runs only if failures exist) ━━━

For each failed test:

Read the full error message and attached screenshot
Navigate to the failing page using Playwright MCP to inspect current state
Make a determination:

BROKEN TEST (the test is wrong): -> The page structure changed, selector no longer exists, or the expected text changed (not a regression, just drift) -> Fix: update the selector or assertion to match current reality -> Re-run the specific test -> If fixed: continue

CONFIRMED BUG (the application is wrong): -> The feature is not behaving as the PRD acceptance criteria describe -> Do NOT fix the test to hide the bug -> Create: qa/bugs/$FEATURE-[timestamp].md with: - Which test failed - What the expected behaviour is (from PRD) - What the actual behaviour is - Screenshot path - Steps to reproduce -> STOP the QA loop -> Report: "Bug confirmed in $FEATURE. QA loop stopped. Run /build $FEATURE with this bug report to fix."

Maximum 3 fix attempts per test before treating as confirmed bug.

━━━ EXPANDER AGENT (runs only if all tests pass) ━━━

Review qa/plans/$FEATURE-prioritized.md and tests/e2e/$FEATURE.spec.ts.

Find gaps — scenarios not yet covered. Look specifically for:

What happens when the user submits an empty form?
What happens at maximum input length (e.g. 10,000 character input)?
What happens if the user navigates away mid-flow and returns?
What happens if the user hits browser back/forward?
What happens on a very slow connection? (use Playwright network throttling)
What happens if the user is not authenticated and tries this feature?
What happens with special characters or emoji in text inputs?

Add 3-5 new tests to tests/e2e/$FEATURE.spec.ts. Append the new scenarios to qa/plans/$FEATURE-prioritized.md. Run the full suite again: npx playwright test tests/e2e/$FEATURE.spec.ts

━━━ SNAPSHOT AGENT ━━━

For every page involved in $FEATURE, capture screenshots at three viewports:

Desktop: 1440 x 900
Tablet: 768 x 1024
Mobile: 375 x 812

Save to: qa/visual-baselines/$FEATURE/[screen]-desktop.png qa/visual-baselines/$FEATURE/[screen]-tablet.png qa/visual-baselines/$FEATURE/[screen]-mobile.png

FIRST RUN BEHAVIOUR: These screenshots ARE the baseline. Save them. Document in qa/visual-baselines/$FEATURE/README.md:

Date baseline was created
What build/commit this represents
Any known intentional visual quirks

SUBSEQUENT RUN BEHAVIOUR: Run: npx playwright test --project=visual Compare each screenshot against baseline. If pixel difference > 2%: flag as visual regression. Save diff images to: qa/visual-reports/$FEATURE-[date]-diff.png A visual regression is treated the same as a test failure.

TO INTENTIONALLY UPDATE BASELINE: Run: npx playwright test --project=visual --update-snapshots Commit new baseline files. Document what changed and why in qa/visual-baselines/$FEATURE/README.md.

━━━ QUALITY GATE ━━━

Calculate score:

P0 tests: All must pass. Any P0 failure = score 0 = STOP HERE. P0 passing: 40 points P1 passing: [passing / total] x 30 points P2 passing: [passing / total] x 15 points Visual match: All snapshots match baseline = 15 points Any visual regression = 0 points for this category

TOTAL POSSIBLE: 100 points

Score < 85: FAIL -> Write full report to qa/QUALITY_LOG.md -> Output to user: which tests failed, which snapshots regressed, what the likely causes are -> "Run /build $FEATURE with this report to address failures."

Score >= 85: PASS -> Append to qa/QUALITY_LOG.md: date, feature, score, test count -> "QA passed for $FEATURE. Score: [X]/100. Proceed to CI."

Source

git clone https://github.com/ajaywadhara/agentic-sdlc-plugin/blob/main/skills/qa-run/SKILL.mdView on GitHub

Overview

qa-run orchestrates an eight-agent QA loop for web apps, starting with live browser exploration via Playwright MCP and progressing through analysis, planning, testing, auditing, healing, expanding, and snapshot steps. It enforces a quality gate of 85 or higher to pass and outputs initial test artifacts for refinement.

How This Skill Works

Browser Agent uses Playwright MCP to navigate every screen, capture the live accessibility tree, and log real selectors. Analyst reads feature code and exploration notes, then Planner generates Given/When/Then test plans for each screen. The workflow outputs initial Playwright test files and an exploration report to guide downstream refinement.

When to Use It

Validating a new or updated web feature across all screens ($FEATURE or all).
Ensuring selectors and accessibility are accurate by testing against the live app.
Generating and refining end-to-end tests from plan artifacts.
Performing health checks and audits to surface errors, warnings, or missing elements.
Expanding coverage to multiple breakpoints and device sizes.

Quick Start

Step 1: Read CLAUDE.md and ensure the dev server is running.
Step 2: Run qa-run with a feature ($FEATURE) or 'all'.
Step 3: Review outputs in qa/browser-tests/$FEATURE/exploration.md and qa/plans/$FEATURE.md, then refine coverage and re-run to meet the 85+ gate.

Best Practices

Explore the live app with Playwright MCP before writing tests; never guess selectors.
Capture and compare accessibility trees, roles, and visible text; base tests on real app state.
Log discrepancies immediately and reference docs/screens/ to resolve gaps.
Test across 3 device sizes (desktop, tablet, mobile) to ensure responsive reliability.
Review qa/browser-tests and qa/plans outputs to iteratively improve coverage.

Example Use Cases

Rolling out a new dashboard feature: qa-run documents live surfaces, generates initial tests, and yields a 92+ quality gate.
Auth flow validation: planner creates Given/When/Then scenarios for login, signup, and password reset.
Regression guard: discovery uncovers mismatches between docs and live UI, prompting fixes and updated exploration.md.
Test file generation: initial .spec.ts files are produced for engineer refinement and faster delivery.
Responsive QA: 3-device snapshots help fix layout issues before PR merge.

Frequently Asked Questions

Add this skill to your agents