Get the FREE Ultimate OpenClaw Setup Guide →

Quality Scoring

npx machina-cli add skill akaszubski/autonomous-dev/quality-scoring --openclaw
Files (1)
SKILL.md
7.7 KB

Quality Scoring

Multi-dimensional assessment for training data quality.

When Activates

Quality assessment, data scoring, multi-dimensional evaluation, IFD scoring, factuality checks, reasoning validation, training data prep


Core Concepts

Quality Scorers (6 Types)

Fast to comprehensive scoring approaches:

  1. FastIFD - Instruction-following difficulty (10-20x faster)
  2. Quality - LLM-based quality (Qwen3-30B, 0.85 ex/s)
  3. MultiDimensional - 5-dimension composite
  4. LLMQuality - Multi-backend (MLX/OpenRouter)
  5. Ensemble - Cross-model ensemble
  6. Tulu3 - Multi-dimensional reference (training_metrics.py)

Quality Dimensions (6 Metrics)

  1. IFD Score (0.0-1.0) - Instruction-following difficulty
  2. Factuality (0.0-1.0) - Hallucination detection
  3. Reasoning (0.0-1.0) - Step-by-step logic quality
  4. Diversity (0.0-1.0) - Dataset-level diversity
  5. Domain (0.0-1.0) - Domain-specific relevance
  6. LLM Quality (1-10) - Tulu3 comprehensive score

Training Thresholds

TypeQualityIFDUse Case
SFT≥8.0≥0.3Base training
DPO chosen≥9.0≥0.5High quality only
DPO rejected≤6.0anyLow quality
RLVR≥9.0≥0.5Verified solutions
Calibration≥8.0≥0.4Uncertainty examples

Quick Reference

ConceptDetailsReference
Scorers6 types (FastIFD to Ensemble)quality-scorers.md
Dimensions6 metrics (IFD to LLM Quality)quality-dimensions.md
ThresholdsBy training type (SFT, DPO, RLVR)training-thresholds.md
Librarytraining_metrics.pyIntegration functions

IFD Score Calculation

from training_metrics import calculate_ifd_score

# IFD = PPL(response) / PPL(response|instruction)
ifd_score = calculate_ifd_score(
    instruction="Explain quantum computing",
    response="Quantum computing uses qubits..."
)
# Higher score = more challenging

DPO Pair Validation

from training_metrics import validate_dpo_pairs

# Validate chosen/rejected quality gap
is_valid = validate_dpo_pairs(
    chosen_score=9.2,  # High quality
    rejected_score=5.8  # Low quality
)
# Ensures quality gap ≥0.15

REQUIRED: DPO Multi-Dimensional Scoring

Every DPO pair MUST have multi-dimensional quality scores before training.

This is a hard requirement — DPO data without quality scores will learn shortcuts (e.g., "longer = better") instead of genuine preference signal.

Required output fields per pair:

  • chosen_score (float): Composite quality score for chosen response
  • rejected_score (float): Composite quality score for rejected response
  • margin (float): chosen_score - rejected_score (must be ≥3.0)

Length bias audit (MUST run before DPO training):

from pathlib import Path
from training_metrics import validate_dpo_pairs

metrics = validate_dpo_pairs(dpo_path=Path("dpo_pairs.jsonl"))

# Check length bias
longer_chosen = sum(1 for p in metrics.pairs if len(p.chosen) > len(p.rejected))
length_bias = longer_chosen / metrics.total_pairs

if length_bias > 0.70:
    raise ValueError(
        f"DPO length bias {length_bias:.0%} > 70% threshold.\n"
        f"Model will learn 'longer = better' shortcut.\n"
        f"Fix: Score by quality dimensions, not length."
    )

# Check quality scores present
missing = sum(1 for p in metrics.pairs if p.chosen_score is None)
if missing > 0:
    raise ValueError(f"{missing} pairs missing quality scores — run scoring first")

Scoring workflow:

  1. Generate DPO pairs (dpo-rlvr-generation skill)
  2. Score all pairs with multi-dimensional scorer (this skill)
  3. Filter by quality margin ≥3.0
  4. Audit length bias ≤70%
  5. Only then proceed to training

RLVR Verifiability

from training_metrics import assess_rlvr_verifiability

# Assess reasoning trace verifiability
verifiable = assess_rlvr_verifiability(
    reasoning_trace="Step 1: ...\nStep 2: ...",
    domain="math"
)
# Math/coding: 90%+ verifiable required

Progressive Disclosure

Detailed guides: See docs/*.md

  • docs/quality-scorers.md - 6 scorer implementations
  • docs/quality-dimensions.md - 6 dimension definitions
  • docs/training-thresholds.md - Thresholds, CLI, distributed performance

Security Considerations

Input Validation (CWE-20)

  • Validate score ranges (0.0-1.0 or 1-10)
  • Sanitize data inputs before scoring
  • Check threshold values before application

Path Traversal (CWE-22)

  • Sanitize file paths for data loading
  • Whitelist directories for training data
  • Validate output paths for scored datasets

Security Patterns (training_metrics.py)

from pathlib import Path

def safe_load_data(data_path: str) -> dict:
    """Load data with path validation."""
    # Validate path within allowed directory
    path = Path(data_path).resolve()
    if not str(path).startswith('/allowed/data/'):
        raise ValueError(f"Path outside allowed directory: {path}")

    # Load safely
    return json.loads(path.read_text())

Distributed Performance

Single Machine Performance

  • M4 Max: ~0.85 ex/s (Qwen3-30B)
  • M3 Ultra: ~0.85 ex/s (Qwen3-30B)

Parallel Processing

  • Combined throughput: ~1.7 ex/s (50/50 split)
  • Scaling: Linear with machine count
  • Bottleneck: Model inference, not I/O

CLI Commands

# Score dataset with FastIFD
python -m training_metrics score \
  --input data/train.jsonl \
  --output data/scored.jsonl \
  --scorer fastifd \
  --threshold 0.3

# Multi-dimensional scoring
python -m training_metrics score \
  --input data/train.jsonl \
  --output data/scored.jsonl \
  --scorer multidim \
  --quality-threshold 8.0 \
  --ifd-threshold 0.5

# DPO pair filtering
python -m training_metrics filter_dpo \
  --input data/dpo_pairs.jsonl \
  --output data/filtered_pairs.jsonl \
  --chosen-threshold 9.0 \
  --rejected-threshold 6.0

# RLVR verifiability check
python -m training_metrics assess_rlvr \
  --input data/rlvr_traces.jsonl \
  --output data/verified.jsonl \
  --domain math \
  --threshold 0.9

Related Skills

  • data-distillation - IFD methodology and KenLM filtering
  • preference-data-quality - DPO and RLVR metrics
  • python-standards - Code quality standards

Library Integration

Primary library: training_metrics.py

Key functions:

  • calculate_ifd_score() - IFD calculation
  • validate_dpo_pairs() - DPO pair validation
  • assess_rlvr_verifiability() - RLVR assessment
  • score_quality() - Multi-dimensional scoring
  • ensemble_score() - Cross-model ensemble

Key Takeaways

  1. 6 scorers - FastIFD (fast) to Ensemble (comprehensive)
  2. 6 dimensions - IFD, Factuality, Reasoning, Diversity, Domain, LLM Quality
  3. Training thresholds - SFT ≥8.0, DPO chosen ≥9.0, RLVR ≥9.0
  4. IFD score - PPL(response) / PPL(response|instruction), higher = harder
  5. Security - CWE-20 (input validation), CWE-22 (path traversal)
  6. Distributed - ~1.7 ex/s with 2 machines (linear scaling)
  7. CLI commands - training_metrics module for all operations
  8. Integration - Use training_metrics library functions
  9. DPO pairs - Chosen ≥9.0, Rejected ≤6.0, gap ≥0.15
  10. RLVR - Math/coding 90%+ verifiable, general 80%+
  11. DPO scoring REQUIRED - Every pair must have chosen_score, rejected_score, margin before training
  12. Length bias audit - ≤70% of pairs where chosen is longer (prevents "longer = better" shortcut)

Source

git clone https://github.com/akaszubski/autonomous-dev/blob/master/plugins/autonomous-dev/skills/quality-scoring/SKILL.mdView on GitHub

Overview

Quality Scoring evaluates training data with six scorers across six metrics to quantify data quality. It enables objective gating for SFT, DPO, RLVR and calibration workflows, focusing on instruction-following difficulty, factuality, reasoning, diversity, domain relevance, and overall LLM quality.

How This Skill Works

Six scorers (FastIFD, Quality, MultiDimensional, LLMQuality, Ensemble, Tulu3) produce per-sample scores across six dimensions: IFD, Factuality, Reasoning, Diversity, Domain, and LLM Quality. Scores are combined into composite metrics and required outputs for DPO (chosen_score, rejected_score, margin). The workflow relies on the training_metrics.py library with examples like calculate_ifd_score and validate_dpo_pairs to gate training data quality.

When to Use It

  • When preparing SFT, DPO, or RLVR training data and you need robust quality gating.
  • When you must gate instruction-following ability, factuality, and reasoning in your data.
  • When validating DPO pairs with multi-dimensional scores before training.
  • When calibrating datasets and handling uncertainty examples.
  • When auditing for length bias and ensuring quality isn't traded for longer responses.

Quick Start

  1. Step 1: Identify the training type and select the multi-dimensional scorer set (FastIFD, Quality, MultiDimensional, LLMQuality, Ensemble, Tulu3).
  2. Step 2: Compute per-sample scores across IFD, Factuality, Reasoning, Diversity, Domain, and LLM Quality using training_metrics.py (e.g., calculate_ifd_score).
  3. Step 3: For DPO, ensure each pair has chosen_score, rejected_score, and margin (≥3.0); run length-bias audits and proceed with training if thresholds are met.

Best Practices

  • Use all six scorers and six dimensions to avoid blind spots.
  • Align scoring thresholds with the training type (SFT, DPO chosen/rejected, RLVR, Calibration).
  • Require multi-dimensional outputs for every DPO pair: chosen_score, rejected_score, and margin (≥3.0).
  • Run a length-bias audit before training to prevent 'longer = better' shortcuts.
  • Leverage training_metrics.py for IFD, DPO validation, and workflow integration.

Example Use Cases

  • Scoring a DPO data pair with chosen_score 9.2, rejected_score 5.8, and margin 3.4 to ensure a quality gap.
  • Calculating IFD Score for an instruction like 'Explain quantum computing' to assess difficulty.
  • Using RLVR-style verification solutions and multi-dimensional scoring to validate results.
  • Calibrating datasets by checking uncertainty examples and achieving Calibration thresholds.
  • Integrating the scoring workflow into a DPO data generation pipeline (generate pairs, score, gate).

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers