Quality Scoring
npx machina-cli add skill akaszubski/autonomous-dev/quality-scoring --openclawQuality Scoring
Multi-dimensional assessment for training data quality.
When Activates
Quality assessment, data scoring, multi-dimensional evaluation, IFD scoring, factuality checks, reasoning validation, training data prep
Core Concepts
Quality Scorers (6 Types)
Fast to comprehensive scoring approaches:
- FastIFD - Instruction-following difficulty (10-20x faster)
- Quality - LLM-based quality (Qwen3-30B, 0.85 ex/s)
- MultiDimensional - 5-dimension composite
- LLMQuality - Multi-backend (MLX/OpenRouter)
- Ensemble - Cross-model ensemble
- Tulu3 - Multi-dimensional reference (training_metrics.py)
Quality Dimensions (6 Metrics)
- IFD Score (0.0-1.0) - Instruction-following difficulty
- Factuality (0.0-1.0) - Hallucination detection
- Reasoning (0.0-1.0) - Step-by-step logic quality
- Diversity (0.0-1.0) - Dataset-level diversity
- Domain (0.0-1.0) - Domain-specific relevance
- LLM Quality (1-10) - Tulu3 comprehensive score
Training Thresholds
| Type | Quality | IFD | Use Case |
|---|---|---|---|
| SFT | ≥8.0 | ≥0.3 | Base training |
| DPO chosen | ≥9.0 | ≥0.5 | High quality only |
| DPO rejected | ≤6.0 | any | Low quality |
| RLVR | ≥9.0 | ≥0.5 | Verified solutions |
| Calibration | ≥8.0 | ≥0.4 | Uncertainty examples |
Quick Reference
| Concept | Details | Reference |
|---|---|---|
| Scorers | 6 types (FastIFD to Ensemble) | quality-scorers.md |
| Dimensions | 6 metrics (IFD to LLM Quality) | quality-dimensions.md |
| Thresholds | By training type (SFT, DPO, RLVR) | training-thresholds.md |
| Library | training_metrics.py | Integration functions |
IFD Score Calculation
from training_metrics import calculate_ifd_score
# IFD = PPL(response) / PPL(response|instruction)
ifd_score = calculate_ifd_score(
instruction="Explain quantum computing",
response="Quantum computing uses qubits..."
)
# Higher score = more challenging
DPO Pair Validation
from training_metrics import validate_dpo_pairs
# Validate chosen/rejected quality gap
is_valid = validate_dpo_pairs(
chosen_score=9.2, # High quality
rejected_score=5.8 # Low quality
)
# Ensures quality gap ≥0.15
REQUIRED: DPO Multi-Dimensional Scoring
Every DPO pair MUST have multi-dimensional quality scores before training.
This is a hard requirement — DPO data without quality scores will learn shortcuts (e.g., "longer = better") instead of genuine preference signal.
Required output fields per pair:
chosen_score(float): Composite quality score for chosen responserejected_score(float): Composite quality score for rejected responsemargin(float): chosen_score - rejected_score (must be ≥3.0)
Length bias audit (MUST run before DPO training):
from pathlib import Path
from training_metrics import validate_dpo_pairs
metrics = validate_dpo_pairs(dpo_path=Path("dpo_pairs.jsonl"))
# Check length bias
longer_chosen = sum(1 for p in metrics.pairs if len(p.chosen) > len(p.rejected))
length_bias = longer_chosen / metrics.total_pairs
if length_bias > 0.70:
raise ValueError(
f"DPO length bias {length_bias:.0%} > 70% threshold.\n"
f"Model will learn 'longer = better' shortcut.\n"
f"Fix: Score by quality dimensions, not length."
)
# Check quality scores present
missing = sum(1 for p in metrics.pairs if p.chosen_score is None)
if missing > 0:
raise ValueError(f"{missing} pairs missing quality scores — run scoring first")
Scoring workflow:
- Generate DPO pairs (dpo-rlvr-generation skill)
- Score all pairs with multi-dimensional scorer (this skill)
- Filter by quality margin ≥3.0
- Audit length bias ≤70%
- Only then proceed to training
RLVR Verifiability
from training_metrics import assess_rlvr_verifiability
# Assess reasoning trace verifiability
verifiable = assess_rlvr_verifiability(
reasoning_trace="Step 1: ...\nStep 2: ...",
domain="math"
)
# Math/coding: 90%+ verifiable required
Progressive Disclosure
Detailed guides: See docs/*.md
docs/quality-scorers.md- 6 scorer implementationsdocs/quality-dimensions.md- 6 dimension definitionsdocs/training-thresholds.md- Thresholds, CLI, distributed performance
Security Considerations
Input Validation (CWE-20)
- Validate score ranges (0.0-1.0 or 1-10)
- Sanitize data inputs before scoring
- Check threshold values before application
Path Traversal (CWE-22)
- Sanitize file paths for data loading
- Whitelist directories for training data
- Validate output paths for scored datasets
Security Patterns (training_metrics.py)
from pathlib import Path
def safe_load_data(data_path: str) -> dict:
"""Load data with path validation."""
# Validate path within allowed directory
path = Path(data_path).resolve()
if not str(path).startswith('/allowed/data/'):
raise ValueError(f"Path outside allowed directory: {path}")
# Load safely
return json.loads(path.read_text())
Distributed Performance
Single Machine Performance
- M4 Max: ~0.85 ex/s (Qwen3-30B)
- M3 Ultra: ~0.85 ex/s (Qwen3-30B)
Parallel Processing
- Combined throughput: ~1.7 ex/s (50/50 split)
- Scaling: Linear with machine count
- Bottleneck: Model inference, not I/O
CLI Commands
# Score dataset with FastIFD
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer fastifd \
--threshold 0.3
# Multi-dimensional scoring
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer multidim \
--quality-threshold 8.0 \
--ifd-threshold 0.5
# DPO pair filtering
python -m training_metrics filter_dpo \
--input data/dpo_pairs.jsonl \
--output data/filtered_pairs.jsonl \
--chosen-threshold 9.0 \
--rejected-threshold 6.0
# RLVR verifiability check
python -m training_metrics assess_rlvr \
--input data/rlvr_traces.jsonl \
--output data/verified.jsonl \
--domain math \
--threshold 0.9
Related Skills
- data-distillation - IFD methodology and KenLM filtering
- preference-data-quality - DPO and RLVR metrics
- python-standards - Code quality standards
Library Integration
Primary library: training_metrics.py
Key functions:
calculate_ifd_score()- IFD calculationvalidate_dpo_pairs()- DPO pair validationassess_rlvr_verifiability()- RLVR assessmentscore_quality()- Multi-dimensional scoringensemble_score()- Cross-model ensemble
Key Takeaways
- 6 scorers - FastIFD (fast) to Ensemble (comprehensive)
- 6 dimensions - IFD, Factuality, Reasoning, Diversity, Domain, LLM Quality
- Training thresholds - SFT ≥8.0, DPO chosen ≥9.0, RLVR ≥9.0
- IFD score - PPL(response) / PPL(response|instruction), higher = harder
- Security - CWE-20 (input validation), CWE-22 (path traversal)
- Distributed - ~1.7 ex/s with 2 machines (linear scaling)
- CLI commands - training_metrics module for all operations
- Integration - Use training_metrics library functions
- DPO pairs - Chosen ≥9.0, Rejected ≤6.0, gap ≥0.15
- RLVR - Math/coding 90%+ verifiable, general 80%+
- DPO scoring REQUIRED - Every pair must have chosen_score, rejected_score, margin before training
- Length bias audit - ≤70% of pairs where chosen is longer (prevents "longer = better" shortcut)
Source
git clone https://github.com/akaszubski/autonomous-dev/blob/master/plugins/autonomous-dev/skills/quality-scoring/SKILL.mdView on GitHub Overview
Quality Scoring evaluates training data with six scorers across six metrics to quantify data quality. It enables objective gating for SFT, DPO, RLVR and calibration workflows, focusing on instruction-following difficulty, factuality, reasoning, diversity, domain relevance, and overall LLM quality.
How This Skill Works
Six scorers (FastIFD, Quality, MultiDimensional, LLMQuality, Ensemble, Tulu3) produce per-sample scores across six dimensions: IFD, Factuality, Reasoning, Diversity, Domain, and LLM Quality. Scores are combined into composite metrics and required outputs for DPO (chosen_score, rejected_score, margin). The workflow relies on the training_metrics.py library with examples like calculate_ifd_score and validate_dpo_pairs to gate training data quality.
When to Use It
- When preparing SFT, DPO, or RLVR training data and you need robust quality gating.
- When you must gate instruction-following ability, factuality, and reasoning in your data.
- When validating DPO pairs with multi-dimensional scores before training.
- When calibrating datasets and handling uncertainty examples.
- When auditing for length bias and ensuring quality isn't traded for longer responses.
Quick Start
- Step 1: Identify the training type and select the multi-dimensional scorer set (FastIFD, Quality, MultiDimensional, LLMQuality, Ensemble, Tulu3).
- Step 2: Compute per-sample scores across IFD, Factuality, Reasoning, Diversity, Domain, and LLM Quality using training_metrics.py (e.g., calculate_ifd_score).
- Step 3: For DPO, ensure each pair has chosen_score, rejected_score, and margin (≥3.0); run length-bias audits and proceed with training if thresholds are met.
Best Practices
- Use all six scorers and six dimensions to avoid blind spots.
- Align scoring thresholds with the training type (SFT, DPO chosen/rejected, RLVR, Calibration).
- Require multi-dimensional outputs for every DPO pair: chosen_score, rejected_score, and margin (≥3.0).
- Run a length-bias audit before training to prevent 'longer = better' shortcuts.
- Leverage training_metrics.py for IFD, DPO validation, and workflow integration.
Example Use Cases
- Scoring a DPO data pair with chosen_score 9.2, rejected_score 5.8, and margin 3.4 to ensure a quality gap.
- Calculating IFD Score for an instruction like 'Explain quantum computing' to assess difficulty.
- Using RLVR-style verification solutions and multi-dimensional scoring to validate results.
- Calibrating datasets by checking uncertainty examples and achieving Calibration thresholds.
- Integrating the scoring workflow into a DPO data generation pipeline (generate pairs, score, gate).