scientific-validation
npx machina-cli add skill akaszubski/autonomous-dev/scientific-validation --openclawScientific Validation Skill
Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.
When This Skill Activates
- Testing claims from books, papers, or expert sources
- Validating rules, strategies, or hypotheses
- Running experiments or backtests
- Keywords: "validate", "test hypothesis", "experiment", "backtest", "prove", "evidence"
Core Principle
Data is the arbiter. Sources can be wrong.
- Expert books can be wrong
- Only empirical validation decides what works
- Document negative results - they're valuable
Phase Overview
| Phase | Name | Key Requirement |
|---|---|---|
| 0 | Claim Verification | Understand what source ACTUALLY claims |
| 1 | Claims Extraction | Document with source citations |
| 1.5 | Publication Bias Prevention | Document ALL claims before selecting |
| 2 | Pre-Registration | Hypothesis BEFORE seeing results |
| 2.3 | Power Analysis | Calculate required n (MANDATORY) |
| 3 | Bias Prevention | Look-ahead, survivorship, selection |
| 3.5 | Walk-Forward | Required for time series (MANDATORY) |
| 4 | Statistical Requirements | p-values, effect sizes, corrections |
| 4.7 | Bayesian Complement | Bayes Factors for ambiguous results |
| 5 | Multi-Source Validation | Test across 3+ contexts |
| 5.3 | Sensitivity Analysis | ±20% parameter stability (MANDATORY) |
| 5.5 | Adversarial Review | Invoke experiment-critic agent |
| 6 | Classification | VALIDATED / REJECTED / INSUFFICIENT |
| 7 | Documentation | Complete audit trail |
| 7.3 | Negative Results | Structured failure documentation |
See: workflow.md for detailed step-by-step instructions per phase.
Quick Reference
Claim Types
| Type | Testable? | Example |
|---|---|---|
| PERFORMANCE | YES | "A beats B on metric X" |
| METHODOLOGICAL | YES | "A enables capability X" |
| PHILOSOPHICAL | MAYBE | "X is important because Y" |
| BEHAVIORAL | HARD | "Humans do X in situation Y" |
Sample Size Requirements (80% Power)
| Effect Size | Cohen's d | Required n |
|---|---|---|
| Small | 0.2 | 394 |
| Medium | 0.5 | 64 |
| Large | 0.8 | 26 |
See: code-examples.md#power-analysis for calculation code.
Classification Criteria
| Status | Criteria |
|---|---|
| VALIDATED | OOS meets all criteria + critic PROCEED |
| CONDITIONAL | OOS meets relaxed criteria (p < 0.10) |
| REJECTED | OOS fails OR negative effect |
| INSUFFICIENT | n < 15 in OOS |
| UNTESTABLE | Required data unavailable |
| INVALID | Circular validation detected |
Domain Effect Thresholds (Trading)
| Metric | Minimum | Strong | Exceptional |
|---|---|---|---|
| Sharpe Ratio | > 0.5 | > 1.0 | > 2.0 |
| Win Rate | > 55% | > 60% | > 70% |
| Profit Factor | > 1.2 | > 1.5 | > 2.0 |
See: code-examples.md#effect-thresholds for other domains.
Bayes Factor Interpretation
| BF | Evidence |
|---|---|
| < 1 | Supports null |
| 1-3 | Anecdotal |
| 3-10 | Moderate |
| 10-30 | Strong |
| > 30 | Very strong |
Critical Rules
1. Pre-Registration
- Document hypothesis BEFORE seeing any results
- Define success criteria BEFORE testing
- No peeking at test data
2. Power Analysis (Phase 2.3)
from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)
Rule: Underpowered studies cannot achieve VALIDATED status.
3. Walk-Forward for Time Series (Phase 3.5)
- Standard K-fold CV → INVALID (temporal leakage)
- Single train/test → CONDITIONAL at best
- Walk-forward → Can achieve VALIDATED
See: code-examples.md#walk-forward for implementation.
4. Multiple Comparison Correction
alpha_corrected = 0.05 / num_claims # Bonferroni
For trading claims: require t-ratio > 3.0 (Harvey et al. standard).
5. Sensitivity Analysis (Phase 5.3)
Test ±20% parameter variation:
- All variations positive → Can achieve VALIDATED
- 1-2 sign flips → CONDITIONAL at best
- 3+ sign flips → REJECTED (fragile)
See: code-examples.md#sensitivity-analysis for implementation.
6. Adversarial Review (Phase 5.5)
Use Task tool:
subagent_type: "experiment-critic"
prompt: "Review experiment EXP-XXX"
MANDATORY before any classification.
Bias Prevention Checklist
| Bias | Prevention |
|---|---|
| Look-ahead | Process data sequentially, compare batch vs streaming |
| Survivorship | Track ALL attempts, not just completions |
| Selection | Report ALL experiments including failures |
| Data snooping | Strict train/test split, no tuning on test data |
| Publication | Document ALL claims before selecting which to test |
Pre-Experiment Checklist
- Claim extracted with source citation
- ALL claims documented (not just tested ones)
- Hypothesis documented BEFORE results
- Power analysis: required n calculated
- Success criteria defined
- Walk-forward configured (time series)
- Costs/constraints specified
Post-Experiment Checklist
- Sample size adequate per power analysis
- p-value AND effect size reported
- Bayesian analysis if ambiguous
- Sensitivity analysis passed
- Adversarial review completed
- Negative results documented if REJECTED
Red Flags
- 100% success rate → Possible bias
- OOS better than training → Possible leakage
- Result flips with ±20% params → Fragile
- Only tested "interesting" claims → Selection bias
Key Principles
- Hypothesis BEFORE data - No peeking
- Power analysis BEFORE experiment - Know required n
- Walk-forward for time series - Preserve temporal order
- Sensitivity analysis - Results must survive ±20% changes
- Adversarial self-critique - Challenge your methodology
- Document negative results - Failures are valuable
- Sources can be wrong - Even experts, even textbooks
Detailed Documentation
| Topic | File |
|---|---|
| Step-by-step workflow | workflow.md |
| Python code examples | code-examples.md |
| Markdown templates | templates.md |
| Adversarial review | ../../agents/experiment-critic.md |
Hard Rules
FORBIDDEN:
- Reporting results without confidence intervals or statistical significance
- Cherry-picking favorable metrics while ignoring unfavorable ones
- Claiming causation from correlation without controlled experiments
REQUIRED:
- All experiments MUST have a documented hypothesis before execution
- All results MUST include sample size, variance, and statistical test used
- Negative results MUST be reported with the same rigor as positive results
- Baselines MUST be established and compared against for every metric
Source
git clone https://github.com/akaszubski/autonomous-dev/blob/master/plugins/autonomous-dev/skills/scientific-validation/SKILL.mdView on GitHub Overview
The Scientific Validation Skill provides a rigorous framework for validating claims from books, papers, or theories. It enforces pre-registration, power analysis, statistical rigor, Bayesian methods, sensitivity analysis, bias prevention, and evidence-based acceptance or rejection.
How This Skill Works
Follow the defined phases from claim verification to documentation, ensuring hypotheses are preregistered before results, performing power analysis to determine required sample size, applying bias controls, using Bayesian methods when results are ambiguous, and validating findings across multiple sources with a complete audit trail.
When to Use It
- Validating claims from books, papers, or expert sources
- Testing rules, strategies, or hypotheses with empirical evidence
- Conducting experiments or backtests with pre-registered hypotheses
- Assessing time-series or sequential claims that require walk-forward analysis
- Reaching evidence-based decisions via multi-source validation and sensitivity checks
Quick Start
- Step 1: Extract the exact claims from the source and cite them clearly
- Step 2: Pre-register hypotheses and perform a power analysis to determine required sample size
- Step 3: Run experiments/backtests, apply sensitivity analysis and Bayesian checks, then classify results
Best Practices
- Document source claims with citations and exact wording
- Pre-register hypotheses and success criteria before accessing results
- Run power analyses to determine adequate sample size (n) before testing
- Incorporate bias prevention: look-ahead, survivorship, and selection effects
- Maintain a complete audit trail and report negative results openly
Example Use Cases
- Backtest a trading rule with preregistered hypotheses and a walk-forward analysis
- Evaluate a published bias-correction method across multiple datasets
- Replicate a claimed performance improvement in 3+ independent contexts
- Use Bayes Factors to resolve ambiguous results in a validation study
- Document negative results to curb publication bias and strengthen conclusions