What is pre-registration in this skill?

Document the hypothesis and success criteria before seeing results; avoid peeking at data.

When is Bayesian analysis used?

As a complement for ambiguous results, using Bayes Factors to assess evidence strength.

What happens after validation?

Results are classified as VALIDATED, REJECTED, or INSUFFICIENT with a complete documentation audit trail.

scientific-validation

Scanned

npx machina-cli add skill akaszubski/autonomous-dev/scientific-validation --openclaw

Files (1)

SKILL.md

7.2 KB

Scientific Validation Skill

Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.

When This Skill Activates

Testing claims from books, papers, or expert sources
Validating rules, strategies, or hypotheses
Running experiments or backtests
Keywords: "validate", "test hypothesis", "experiment", "backtest", "prove", "evidence"

Core Principle

Data is the arbiter. Sources can be wrong.

Expert books can be wrong
Only empirical validation decides what works
Document negative results - they're valuable

Phase Overview

Phase	Name	Key Requirement
0	Claim Verification	Understand what source ACTUALLY claims
1	Claims Extraction	Document with source citations
1.5	Publication Bias Prevention	Document ALL claims before selecting
2	Pre-Registration	Hypothesis BEFORE seeing results
2.3	Power Analysis	Calculate required n (MANDATORY)
3	Bias Prevention	Look-ahead, survivorship, selection
3.5	Walk-Forward	Required for time series (MANDATORY)
4	Statistical Requirements	p-values, effect sizes, corrections
4.7	Bayesian Complement	Bayes Factors for ambiguous results
5	Multi-Source Validation	Test across 3+ contexts
5.3	Sensitivity Analysis	±20% parameter stability (MANDATORY)
5.5	Adversarial Review	Invoke experiment-critic agent
6	Classification	VALIDATED / REJECTED / INSUFFICIENT
7	Documentation	Complete audit trail
7.3	Negative Results	Structured failure documentation

See: workflow.md for detailed step-by-step instructions per phase.

Quick Reference

Claim Types

Type	Testable?	Example
PERFORMANCE	YES	"A beats B on metric X"
METHODOLOGICAL	YES	"A enables capability X"
PHILOSOPHICAL	MAYBE	"X is important because Y"
BEHAVIORAL	HARD	"Humans do X in situation Y"

Sample Size Requirements (80% Power)

Effect Size	Cohen's d	Required n
Small	0.2	394
Medium	0.5	64
Large	0.8	26

See: code-examples.md#power-analysis for calculation code.

Classification Criteria

Status	Criteria
VALIDATED	OOS meets all criteria + critic PROCEED
CONDITIONAL	OOS meets relaxed criteria (p < 0.10)
REJECTED	OOS fails OR negative effect
INSUFFICIENT	n < 15 in OOS
UNTESTABLE	Required data unavailable
INVALID	Circular validation detected

Domain Effect Thresholds (Trading)

Metric	Minimum	Strong	Exceptional
Sharpe Ratio	> 0.5	> 1.0	> 2.0
Win Rate	> 55%	> 60%	> 70%
Profit Factor	> 1.2	> 1.5	> 2.0

See: code-examples.md#effect-thresholds for other domains.

Bayes Factor Interpretation

BF	Evidence
< 1	Supports null
1-3	Anecdotal
3-10	Moderate
10-30	Strong
> 30	Very strong

Critical Rules

1. Pre-Registration

Document hypothesis BEFORE seeing any results
Define success criteria BEFORE testing
No peeking at test data

2. Power Analysis (Phase 2.3)

from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)

Rule: Underpowered studies cannot achieve VALIDATED status.

3. Walk-Forward for Time Series (Phase 3.5)

Standard K-fold CV → INVALID (temporal leakage)
Single train/test → CONDITIONAL at best
Walk-forward → Can achieve VALIDATED

See: code-examples.md#walk-forward for implementation.

4. Multiple Comparison Correction

alpha_corrected = 0.05 / num_claims  # Bonferroni

For trading claims: require t-ratio > 3.0 (Harvey et al. standard).

5. Sensitivity Analysis (Phase 5.3)

Test ±20% parameter variation:

All variations positive → Can achieve VALIDATED
1-2 sign flips → CONDITIONAL at best
3+ sign flips → REJECTED (fragile)

See: code-examples.md#sensitivity-analysis for implementation.

6. Adversarial Review (Phase 5.5)

Use Task tool:
  subagent_type: "experiment-critic"
  prompt: "Review experiment EXP-XXX"

MANDATORY before any classification.

Bias Prevention Checklist

Bias	Prevention
Look-ahead	Process data sequentially, compare batch vs streaming
Survivorship	Track ALL attempts, not just completions
Selection	Report ALL experiments including failures
Data snooping	Strict train/test split, no tuning on test data
Publication	Document ALL claims before selecting which to test

Pre-Experiment Checklist

Claim extracted with source citation
ALL claims documented (not just tested ones)
Hypothesis documented BEFORE results
Power analysis: required n calculated
Success criteria defined
Walk-forward configured (time series)
Costs/constraints specified

Post-Experiment Checklist

Sample size adequate per power analysis
p-value AND effect size reported
Bayesian analysis if ambiguous
Sensitivity analysis passed
Adversarial review completed
Negative results documented if REJECTED

Red Flags

100% success rate → Possible bias
OOS better than training → Possible leakage
Result flips with ±20% params → Fragile
Only tested "interesting" claims → Selection bias

Key Principles

Hypothesis BEFORE data - No peeking
Power analysis BEFORE experiment - Know required n
Walk-forward for time series - Preserve temporal order
Sensitivity analysis - Results must survive ±20% changes
Adversarial self-critique - Challenge your methodology
Document negative results - Failures are valuable
Sources can be wrong - Even experts, even textbooks

Detailed Documentation

Topic	File
Step-by-step workflow	`workflow.md`
Python code examples	`code-examples.md`
Markdown templates	`templates.md`
Adversarial review	`../../agents/experiment-critic.md`

Hard Rules

FORBIDDEN:

Reporting results without confidence intervals or statistical significance
Cherry-picking favorable metrics while ignoring unfavorable ones
Claiming causation from correlation without controlled experiments

REQUIRED:

All experiments MUST have a documented hypothesis before execution
All results MUST include sample size, variance, and statistical test used
Negative results MUST be reported with the same rigor as positive results
Baselines MUST be established and compared against for every metric

Source

git clone https://github.com/akaszubski/autonomous-dev/blob/master/plugins/autonomous-dev/skills/scientific-validation/SKILL.md

View on GitHub

Overview

The Scientific Validation Skill provides a rigorous framework for validating claims from books, papers, or theories. It enforces pre-registration, power analysis, statistical rigor, Bayesian methods, sensitivity analysis, bias prevention, and evidence-based acceptance or rejection.

How This Skill Works

Follow the defined phases from claim verification to documentation, ensuring hypotheses are preregistered before results, performing power analysis to determine required sample size, applying bias controls, using Bayesian methods when results are ambiguous, and validating findings across multiple sources with a complete audit trail.

When to Use It

Validating claims from books, papers, or expert sources
Testing rules, strategies, or hypotheses with empirical evidence
Conducting experiments or backtests with pre-registered hypotheses
Assessing time-series or sequential claims that require walk-forward analysis
Reaching evidence-based decisions via multi-source validation and sensitivity checks

Quick Start

Step 1: Extract the exact claims from the source and cite them clearly
Step 2: Pre-register hypotheses and perform a power analysis to determine required sample size
Step 3: Run experiments/backtests, apply sensitivity analysis and Bayesian checks, then classify results

Best Practices

Document source claims with citations and exact wording
Pre-register hypotheses and success criteria before accessing results
Run power analyses to determine adequate sample size (n) before testing
Incorporate bias prevention: look-ahead, survivorship, and selection effects
Maintain a complete audit trail and report negative results openly

Example Use Cases

Backtest a trading rule with preregistered hypotheses and a walk-forward analysis
Evaluate a published bias-correction method across multiple datasets
Replicate a claimed performance improvement in 3+ independent contexts
Use Bayes Factors to resolve ambiguous results in a validation study
Document negative results to curb publication bias and strengthen conclusions

Frequently Asked Questions

Add this skill to your agents