Probability the observed result is due to chance rather than a real effect.

Minimum Detectable Effect—the smallest lift worth detecting.

Which command should I use?

Use significance to judge results, sample-size to plan, and duration to estimate run time.

ab-test-stats

Scanned

npx machina-cli add skill guia-matthieu/clawfu-skills/ab-test-stats --openclaw

Files (1)

SKILL.md

4.1 KB

A/B Test Statistics Calculator

Calculate statistical significance for A/B tests - know when your results are real, not random chance.

When to Use This Skill

Test analysis - Determine if results are statistically significant
Sample planning - Calculate required sample size before testing
Duration estimation - Know how long to run experiments
Power analysis - Ensure tests can detect meaningful differences

What Claude Does vs What You Decide

Claude Does	You Decide
Structures analysis frameworks	Metric definitions
Identifies patterns in data	Business interpretation
Creates visualization templates	Dashboard design
Suggests optimization areas	Action priorities
Calculates statistical measures	Decision thresholds

Dependencies

pip install scipy numpy click

Commands

Check Significance

python scripts/main.py significance --control 1000,50 --variant 1000,65
python scripts/main.py significance --control 5000,250 --variant 5000,300 --confidence 0.99

Calculate Sample Size

python scripts/main.py sample-size --baseline 0.05 --mde 0.02
python scripts/main.py sample-size --baseline 0.10 --mde 0.01 --power 0.90

Estimate Duration

python scripts/main.py duration --traffic 1000 --baseline 0.05 --mde 0.02

Examples

Example 1: Analyze Test Results

# Control: 1000 visitors, 50 conversions (5%)
# Variant: 1000 visitors, 65 conversions (6.5%)
python scripts/main.py significance --control 1000,50 --variant 1000,65

# Output:
# A/B Test Results
# ─────────────────────────
# Control:  5.00% (50/1000)
# Variant:  6.50% (65/1000)
# Lift:     +30.0%
#
# Statistical Analysis
# ─────────────────────────
# p-value:      0.089
# Confidence:   91.1%
# Result:       NOT SIGNIFICANT (need 95%)
#
# Recommendation: Continue test for more data

Example 2: Plan Sample Size

# Baseline 5% conversion, want to detect 20% relative lift (1% absolute)
python scripts/main.py sample-size --baseline 0.05 --mde 0.01

# Output:
# Sample Size Calculator
# ──────────────────────────────
# Baseline conversion: 5.0%
# Minimum detectable effect: 1.0% (20% relative)
# Target conversion: 6.0%
#
# Required per variant: 3,842 visitors
# Total required: 7,684 visitors
#
# At 1000 daily visitors: ~8 days

Key Concepts

Term	Definition
p-value	Probability result is due to chance
Confidence	1 - p-value (usually want 95%+)
Power	Probability of detecting real effect (usually 80%)
MDE	Minimum Detectable Effect - smallest lift worth detecting
Lift	Relative improvement (variant - control) / control

When Results Are Significant

p-value	Confidence	Verdict
< 0.01	> 99%	Highly Significant ✓
< 0.05	> 95%	Significant ✓
< 0.10	> 90%	Marginally Significant
≥ 0.10	< 90%	Not Significant ✗

Skill Boundaries

What This Skill Does Well

Structuring data analysis
Identifying patterns and trends
Creating visualization frameworks
Calculating statistical measures

What This Skill Cannot Do

Access your actual data
Replace statistical expertise
Make business decisions
Guarantee prediction accuracy

Related Skills

cohort-analysis - Analyze user cohorts
funnel-analyzer - Analyze conversion funnels

Skill Metadata

Mode: centaur

category: analytics
subcategory: statistics
dependencies: [scipy, numpy]
difficulty: intermediate
time_saved: 3+ hours/week

Source

git clone https://github.com/guia-matthieu/clawfu-skills/blob/main/skills/analytics/ab-test-stats/SKILL.mdView on GitHub

Overview

Determines if A/B test results are statistically significant and guides planning. It helps you calculate required sample sizes, estimate experiment duration, and perform power analyses to detect meaningful differences.

How This Skill Works

The tool computes p-values, confidence levels, and lift from control/variant data to assess significance. It provides commands for significance, sample-size, and duration, leveraging scipy and numpy for robust statistical calculations.

When to Use It

Determine if test results are statistically significant
Calculate required sample size before running a test
Estimate how long an experiment should run
Perform power analysis to detect meaningful differences
Analyze conversion experiments to guide decisions

Quick Start

Step 1: Install dependencies (pip install scipy numpy click)
Step 2: Run significance with control/variant data, e.g., python scripts/main.py significance --control 1000,50 --variant 1000,65
Step 3: Interpret the output (p-value, confidence, result) and decide to continue or stop

Best Practices

Define baseline and minimum detectable effect (MDE) before testing
Use adequate sample sizes to avoid false negatives or positives
Interpret p-values and confidence levels in the context of your business goals
Run power analysis to ensure the test can detect the desired lift
Plan duration with realistic traffic estimates to prevent premature conclusions

Example Use Cases

Example 1: Analyze Test Results – compare control vs. variant conversions and assess significance
Example 2: Plan Sample Size – baseline 5% with 1% absolute MDE to compute required visitors
Example 3: Estimate Duration – with 1000 daily visitors, baseline 5% and MDE 2%
Example 4: Output Interpretation – determine if results are not significant and require more data
Example 5: Power Check – ensure the test design can detect the desired lift before starting

Frequently Asked Questions

Add this skill to your agents