Get the FREE Ultimate OpenClaw Setup Guide →

ab-test-analysis

Scanned
npx machina-cli add skill phuryn/pm-skills/ab-test-analysis --openclaw
Files (1)
SKILL.md
3.5 KB

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

  1. Understand the experiment:

    • What was the hypothesis?
    • What was changed (the variant)?
    • What is the primary metric? Any guardrail metrics?
    • How long did the test run?
    • What is the traffic split?
  2. Validate the test setup:

    • Sample size: Is the sample large enough for the expected effect size?
      • Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
      • Flag if the test is underpowered (<80% power)
    • Duration: Did the test run for at least 1-2 full business cycles?
    • Randomization: Any evidence of sample ratio mismatch (SRM)?
    • Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
  3. Calculate statistical significance:

    • Conversion rate for control and variant
    • Relative lift: (variant - control) / control × 100
    • p-value: Using a two-tailed z-test or chi-squared test
    • Confidence interval: 95% CI for the difference
    • Statistical significance: Is p < 0.05?
    • Practical significance: Is the lift meaningful for the business?

    If the user provides raw data, generate and run a Python script to calculate these.

  4. Check guardrail metrics:

    • Did any guardrail metrics (revenue, engagement, page load time) degrade?
    • A winning primary metric with degraded guardrails may not be a true win
  5. Interpret results:

    OutcomeRecommendation
    Significant positive lift, no guardrail issuesShip it — roll out to 100%
    Significant positive lift, guardrail concernsInvestigate — understand trade-offs before shipping
    Not significant, positive trendExtend the test — need more data or larger effect
    Not significant, flatStop the test — no meaningful difference detected
    Significant negative liftDon't ship — revert to control, analyze why
  6. Provide the analysis summary:

    ## A/B Test Results: [Test Name]
    
    **Hypothesis**: [What we expected]
    **Duration**: [X days] | **Sample**: [N control / M variant]
    
    | Metric | Control | Variant | Lift | p-value | Significant? |
    |---|---|---|---|---|---|
    | [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
    | [Guardrail] | ... | ... | ... | ... | ... |
    
    **Recommendation**: [Ship / Extend / Stop / Investigate]
    **Reasoning**: [Why]
    **Next steps**: [What to do]
    

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.


Further Reading

Source

git clone https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.mdView on GitHub

Overview

Analyze A/B test results with statistical rigor and translate findings into clear product decisions. It validates setup, computes lift, p-values, and confidence intervals, and evaluates guardrails to guide ship/extend/stop choices.

How This Skill Works

Read the experiment data (if provided) and compute conversion rates for control and variant, the relative lift, p-value (two-tailed z-test or chi-squared), and the 95% CI for the difference. It validates sample size with the standard formula n = (Z²α/2 × 2 × p × (1-p)) / MDE², flags underpowered tests, and applies the predefined outcomes (ship/extend/stop/investigate).

When to Use It

  • You need to determine if the primary metric shows a statistically significant lift.
  • You want to validate sample size and power before acting.
  • Guardrail metrics (revenue, engagement, load time) must be checked alongside the main metric.
  • You need to interpret the results and pick ship, extend, stop, or investigate.
  • You have raw data to generate an automated Python script for calculations.

Quick Start

  1. Step 1: Gather experiment data: hypotheses, variant, metrics, duration, and traffic split.
  2. Step 2: Compute control and variant conversions, lift, p-value, and 95% CI; check sample size against MDE.
  3. Step 3: Apply the decision matrix and document the analysis; if data is available, generate and run a Python script.

Best Practices

  • Define primary and guardrail metrics up front.
  • Predefine MDE, alpha, power; verify sample size and duration.
  • Report both p-values and confidence intervals; use two-tailed tests.
  • Assess practical significance, not just statistical.
  • Document the final decision with next steps and rationale.

Example Use Cases

  • Signup CTA color variant shows significant lift (6%), with no guardrail degradation — ship to all users.
  • Pricing page variant yields non-significant lift; extend the test to increase power or revisit MDE.
  • Homepage redesign significantly improves engagement but increases page load time; investigate trade-offs before shipping.
  • Recommendation feed yields significant positive lift only in mobile traffic; extend test and explore segmentation.
  • Raw data provided; Python script generated to compute conversions, lift, p-values, and 95% CI.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers