What defines significance in AB test analysis?

Significance is typically p < 0.05 with a 95% confidence interval for the difference; practical significance should also be considered.

What if the test is underpowered?

Extend the test by collecting more data, increase exposure, or revisit the MDE and power assumptions to improve power.

Can I automate calculations with Python?

Yes; the skill can generate Python scripts to calculate conversions, lift, p-values, and confidence intervals from raw data.

ab-test-analysis

Scanned

npx machina-cli add skill phuryn/pm-skills/ab-test-analysis --openclaw

Files (1)

SKILL.md

3.5 KB

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
  - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  - Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win

Interpret results:

Outcome	Recommendation
Significant positive lift, no guardrail issues	Ship it — roll out to 100%
Significant positive lift, guardrail concerns	Investigate — understand trade-offs before shipping
Not significant, positive trend	Extend the test — need more data or larger effect
Not significant, flat	Stop the test — no meaningful difference detected
Significant negative lift	Don't ship — revert to control, analyze why

Provide the analysis summary:

## A/B Test Results: [Test Name]

**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |

**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.

Source

git clone https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.mdView on GitHub

Overview

Analyze A/B test results with statistical rigor and translate findings into clear product decisions. It validates setup, computes lift, p-values, and confidence intervals, and evaluates guardrails to guide ship/extend/stop choices.

How This Skill Works

Read the experiment data (if provided) and compute conversion rates for control and variant, the relative lift, p-value (two-tailed z-test or chi-squared), and the 95% CI for the difference. It validates sample size with the standard formula n = (Z²α/2 × 2 × p × (1-p)) / MDE², flags underpowered tests, and applies the predefined outcomes (ship/extend/stop/investigate).

When to Use It

You need to determine if the primary metric shows a statistically significant lift.
You want to validate sample size and power before acting.
Guardrail metrics (revenue, engagement, load time) must be checked alongside the main metric.
You need to interpret the results and pick ship, extend, stop, or investigate.
You have raw data to generate an automated Python script for calculations.

Quick Start

Step 1: Gather experiment data: hypotheses, variant, metrics, duration, and traffic split.
Step 2: Compute control and variant conversions, lift, p-value, and 95% CI; check sample size against MDE.
Step 3: Apply the decision matrix and document the analysis; if data is available, generate and run a Python script.

Best Practices

Define primary and guardrail metrics up front.
Predefine MDE, alpha, power; verify sample size and duration.
Report both p-values and confidence intervals; use two-tailed tests.
Assess practical significance, not just statistical.
Document the final decision with next steps and rationale.

Example Use Cases

Signup CTA color variant shows significant lift (6%), with no guardrail degradation — ship to all users.
Pricing page variant yields non-significant lift; extend the test to increase power or revisit MDE.
Homepage redesign significantly improves engagement but increases page load time; investigate trade-offs before shipping.
Recommendation feed yields significant positive lift only in mobile traffic; extend test and explore segmentation.
Raw data provided; Python script generated to compute conversions, lift, p-values, and 95% CI.

Frequently Asked Questions

Add this skill to your agents