ml-validate
npx machina-cli add skill nishide-dev/claude-code-ml-research/ml-validate --openclawML Project Validation
Comprehensive validation of ML project structure, configurations, code quality, and training readiness.
Quick Start
# Run full validation
python scripts/validate_project.py
# Quick config check
python src/train.py --cfg job
# Fast dev run (1 batch train/val/test)
python src/train.py trainer.fast_dev_run=true
Validation Checks
1. Project Structure
Required directories:
src/- Source codesrc/models/- Model implementationssrc/data/- DataModule implementationsconfigs/- Hydra configuration filestests/- Unit tests (recommended)
Required files:
src/train.py- Training scriptconfigs/config.yaml- Main configpyproject.tomlorpixi.toml- Package manager
Check manually:
# Verify structure
test -d src && test -d configs && echo "✓ Basic structure OK"
test -f src/train.py && echo "✓ Training script found"
test -f configs/config.yaml && echo "✓ Main config found"
2. Configuration Validation
YAML syntax:
# Validate all YAML files
python -c "
import yaml
from pathlib import Path
for yaml_file in Path('configs').rglob('*.yaml'):
try:
yaml.safe_load(yaml_file.read_text())
print(f'✓ {yaml_file}')
except yaml.YAMLError as e:
print(f'❌ {yaml_file}: {e}')
"
Config composition:
# Test Hydra config loads correctly
python src/train.py --cfg job
target validation:
- All
_target_paths must be importable - Check model, data, trainer, logger targets
- Verify no typos in module paths
Use scripts/validate_project.py for automated checking.
3. Code Quality
Linting:
# Ruff checks
ruff check src/ tests/
# Auto-fix issues
ruff check --fix src/ tests/
Type checking:
# ty (type checker)
ty check src/
# mypy (alternative)
mypy src/ --ignore-missing-imports
Import validation:
# Check all files have valid Python syntax
import ast
from pathlib import Path
for py_file in Path("src").rglob("*.py"):
try:
ast.parse(py_file.read_text())
print(f"✓ {py_file}")
except SyntaxError as e:
print(f"❌ {py_file}: {e}")
4. Dependencies
Required packages:
torch- PyTorchpytorch_lightning- Lightning frameworkhydra-core- Configuration management
Optional but recommended:
wandb- Experiment trackingtensorboard- Visualizationtorch_geometric- For GNNstransformers- For NLP
Check installation:
python -c "
import torch
import pytorch_lightning
import hydra
print(f'PyTorch: {torch.__version__}')
print(f'Lightning: {pytorch_lightning.__version__}')
print(f'Hydra: {hydra.__version__}')
"
GPU availability:
python -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
print(f'CUDA version: {torch.version.cuda}')
print(f'GPU count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
"
5. Data Pipeline
DataModule instantiation:
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate
from pathlib import Path
# Load config
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
# Instantiate DataModule
dm = instantiate(cfg.data)
print(f"✓ DataModule: {type(dm).__name__}")
# Test setup
dm.setup("fit")
print("✓ DataModule.setup() successful")
# Check dataloaders
train_loader = dm.train_dataloader()
print(f"✓ Train batches: {len(train_loader)}")
Data directory:
# Verify data path exists
python -c "
from omegaconf import OmegaConf
from pathlib import Path
cfg = OmegaConf.load('configs/config.yaml')
data_dir = Path(cfg.data.data_dir)
if data_dir.exists():
print(f'✓ Data directory: {data_dir}')
print(f' Files: {len(list(data_dir.rglob(\"*\")))}')
else:
print(f'⚠️ Data directory not found: {data_dir}')
"
6. Model Validation
Model instantiation:
from hydra import compose, initialize_config_dir
from hydra.utils import instantiate
from pathlib import Path
# Load config
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
# Instantiate model
model = instantiate(cfg.model)
print(f"✓ Model: {type(model).__name__}")
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f" Total params: {total_params:,}")
print(f" Trainable: {trainable_params:,}")
Forward pass test:
import torch
# Create dummy input (adjust for your model)
batch_size = 2
dummy_input = torch.randn(batch_size, 3, 224, 224)
# Test forward pass
model.eval()
with torch.no_grad():
output = model(dummy_input)
print(f"✓ Forward pass OK")
print(f" Input: {dummy_input.shape}")
print(f" Output: {output.shape}")
7. Training Readiness
Fast dev run:
# Run 1 batch of train/val/test
python src/train.py trainer.fast_dev_run=true
# Expected output:
# - No errors
# - Completes in <1 minute
# - Shows train/val/test progress
Logger check:
from hydra import compose, initialize_config_dir
from pathlib import Path
import os
config_dir = Path.cwd() / "configs"
with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
cfg = compose(config_name="config")
if "logger" in cfg:
print(f"✓ Logger: {cfg.logger.get('_target_', 'unknown')}")
# Check W&B credentials if using wandb
if "wandb" in str(cfg.logger.get("_target_", "")):
if "WANDB_API_KEY" in os.environ:
print("✓ W&B API key set")
else:
print("⚠️ W&B not logged in (run: wandb login)")
Validation Script
Use the automated validation script:
python scripts/validate_project.py
What it checks:
- ✓ Project structure (directories & files)
- ✓ Config YAML syntax
- ✓ Config composition
- ✓ target paths are importable
- ✓ Code quality (ruff)
- ✓ Dependencies installed
- ✓ GPU availability
- ✓ Model instantiation
- ✓ DataModule instantiation
- ✓ Fast dev run
Example output:
INFO: Starting ML project validation...
INFO: ✓ Project structure valid
INFO: ✓ All configs valid
INFO: ✓ Code quality OK
INFO: ✓ All dependencies installed
INFO: ✓ Model instantiated successfully
INFO: ✓ DataModule instantiated successfully
INFO: ✓ Fast dev run completed
INFO: ✓ All validation checks passed!
See scripts/validate_project.py for implementation.
Quick Checks
One-line Validation
# Config only
python src/train.py --cfg job && echo "✓ Config OK"
# Full validation
python scripts/validate_project.py && echo "✓ All OK"
Pre-Training Checklist
# 1. Structure
test -d src -a -d configs -a -f src/train.py && echo "✓ Structure"
# 2. Config
python src/train.py --cfg job && echo "✓ Config"
# 3. Dependencies
python -c "import torch, pytorch_lightning, hydra" && echo "✓ Deps"
# 4. GPU
python -c "import torch; assert torch.cuda.is_available()" && echo "✓ GPU"
# 5. Fast dev run
python src/train.py trainer.fast_dev_run=true && echo "✓ Training"
CI/CD Integration
Add to .github/workflows/validate.yml:
name: Validate ML Project
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: uv sync --all-extras
- name: Validate project
run: uv run python scripts/validate_project.py
- name: Test config
run: uv run python src/train.py --cfg job
- name: Fast dev run
run: uv run python src/train.py trainer.fast_dev_run=true
Common Issues
"Config composition failed"
Cause: Typo in defaults or invalid YAML.
Fix:
# Check YAML syntax
python -c "import yaml; yaml.safe_load(open('configs/config.yaml'))"
# Check defaults exist
ls configs/model/ configs/data/ configs/trainer/
"target not found"
Cause: Module path incorrect or not installed.
Fix:
# Check import works
python -c "from src.models.my_model import MyModel"
# Verify path in config matches file structure
"DataModule setup failed"
Cause: Data directory missing or incorrect path.
Fix:
# Check data path in config
grep data_dir configs/data/*.yaml
# Create data directory
mkdir -p data/
"Fast dev run failed"
Cause: Various issues in training loop.
Fix:
# Run with verbose logging
python src/train.py trainer.fast_dev_run=true --verbose
# Check logs for specific error
Success Criteria
- Project structure valid
- All YAML files valid
- Config composes without errors
- All target paths importable
- Code passes linting
- Required deps installed
- GPU available (if needed)
- Model instantiates
- DataModule instantiates
- Fast dev run succeeds
- Logger configured
✅ Project is ready for training!
Source
git clone https://github.com/nishide-dev/claude-code-ml-research/blob/main/skills/ml-validate/SKILL.mdView on GitHub Overview
ML-validate provides a comprehensive suite to audit ML projects. It checks project structure, configuration loading, data pipelines, model architecture, and dependencies to ensure training readiness and easier debugging.
How This Skill Works
It runs a targeted validation suite across five areas: project structure, configuration validation, code quality, dependencies, and the data pipeline. Each step verifies presence, correctness, and importability, then reports issues to fix before training.
When to Use It
- Starting a new ML project to verify required folders, files, and tooling.
- Before any training run to catch config, dependency, or environment problems.
- While debugging Hydra configs or _target_ imports to resolve setup errors.
- After adding or changing the data pipeline or model modules to ensure integration.
- In CI or local environments to confirm dependencies and GPU availability.
Quick Start
- Step 1: Run full validation: python scripts/validate_project.py
- Step 2: Quick config check: python src/train.py --cfg job
- Step 3: Fast dev run: python src/train.py trainer.fast_dev_run=true
Best Practices
- Run the full validation script (validate_project.py) first.
- Validate all YAML configs and confirm Hydra loads with cfg.job.
- Run linting (ruff) and type checks (ty, mypy) and fix syntax errors.
- Verify dependencies and GPU availability in the target environment.
- Test the DataModule setup and dataloaders with a quick fit/test.
Example Use Cases
- Kickoff a new project by validating the basic structure and core scripts.
- Before training, ensure configs, targets, and models are wired correctly.
- CI failure due to YAML or import errors is quickly diagnosed with config validation.
- Dependency version mismatches are surfaced by the installation checks.
- DataModule instantiation confirms dataloaders and setup run without errors.