Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/lm-evaluation-harness --openclaw
Files (1)
SKILL.md
11.6 KB

lm-evaluation-harness - LLM Benchmarking

Quick start

lm-evaluation-harness evaluates LLMs across 60+ academic benchmarks using standardized prompts and metrics.

Installation:

pip install lm-eval

Evaluate any HuggingFace model:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag \
  --device cuda:0 \
  --batch_size 8

View available tasks:

lm_eval --tasks list

Common workflows

Workflow 1: Standard benchmark evaluation

Evaluate model on core benchmarks (MMLU, GSM8K, HumanEval).

Copy this checklist:

Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model
- [ ] Step 3: Run evaluation
- [ ] Step 4: Analyze results

Step 1: Choose benchmark suite

Core reasoning benchmarks:

  • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice
  • GSM8K - Grade school math word problems
  • HellaSwag - Common sense reasoning
  • TruthfulQA - Truthfulness and factuality
  • ARC (AI2 Reasoning Challenge) - Science questions

Code benchmarks:

  • HumanEval - Python code generation (164 problems)
  • MBPP (Mostly Basic Python Problems) - Python coding

Standard suite (recommended for model releases):

--tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge

Step 2: Configure model

HuggingFace model:

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
  --tasks mmlu \
  --device cuda:0 \
  --batch_size auto  # Auto-detect optimal batch size

Quantized model (4-bit/8-bit):

lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,load_in_4bit=True \
  --tasks mmlu \
  --device cuda:0

Custom checkpoint:

lm_eval --model hf \
  --model_args pretrained=/path/to/my-model,tokenizer=/path/to/tokenizer \
  --tasks mmlu \
  --device cuda:0

Step 3: Run evaluation

# Full MMLU evaluation (57 subjects)
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --num_fewshot 5 \  # 5-shot evaluation (standard)
  --batch_size 8 \
  --output_path results/ \
  --log_samples  # Save individual predictions

# Multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu,gsm8k,hellaswag,truthfulqa,arc_challenge \
  --num_fewshot 5 \
  --batch_size 8 \
  --output_path results/llama2-7b-eval.json

Step 4: Analyze results

Results saved to results/llama2-7b-eval.json:

{
  "results": {
    "mmlu": {
      "acc": 0.459,
      "acc_stderr": 0.004
    },
    "gsm8k": {
      "exact_match": 0.142,
      "exact_match_stderr": 0.006
    },
    "hellaswag": {
      "acc_norm": 0.765,
      "acc_norm_stderr": 0.004
    }
  },
  "config": {
    "model": "hf",
    "model_args": "pretrained=meta-llama/Llama-2-7b-hf",
    "num_fewshot": 5
  }
}

Workflow 2: Track training progress

Evaluate checkpoints during training.

Training Progress Tracking:
- [ ] Step 1: Set up periodic evaluation
- [ ] Step 2: Choose quick benchmarks
- [ ] Step 3: Automate evaluation
- [ ] Step 4: Plot learning curves

Step 1: Set up periodic evaluation

Evaluate every N training steps:

#!/bin/bash
# eval_checkpoint.sh

CHECKPOINT_DIR=$1
STEP=$2

lm_eval --model hf \
  --model_args pretrained=$CHECKPOINT_DIR/checkpoint-$STEP \
  --tasks gsm8k,hellaswag \
  --num_fewshot 0 \  # 0-shot for speed
  --batch_size 16 \
  --output_path results/step-$STEP.json

Step 2: Choose quick benchmarks

Fast benchmarks for frequent evaluation:

  • HellaSwag: ~10 minutes on 1 GPU
  • GSM8K: ~5 minutes
  • PIQA: ~2 minutes

Avoid for frequent eval (too slow):

  • MMLU: ~2 hours (57 subjects)
  • HumanEval: Requires code execution

Step 3: Automate evaluation

Integrate with training script:

# In training loop
if step % eval_interval == 0:
    model.save_pretrained(f"checkpoints/step-{step}")

    # Run evaluation
    os.system(f"./eval_checkpoint.sh checkpoints step-{step}")

Or use PyTorch Lightning callbacks:

from pytorch_lightning import Callback

class EvalHarnessCallback(Callback):
    def on_validation_epoch_end(self, trainer, pl_module):
        step = trainer.global_step
        checkpoint_path = f"checkpoints/step-{step}"

        # Save checkpoint
        trainer.save_checkpoint(checkpoint_path)

        # Run lm-eval
        os.system(f"lm_eval --model hf --model_args pretrained={checkpoint_path} ...")

Step 4: Plot learning curves

import json
import matplotlib.pyplot as plt

# Load all results
steps = []
mmlu_scores = []

for file in sorted(glob.glob("results/step-*.json")):
    with open(file) as f:
        data = json.load(f)
        step = int(file.split("-")[1].split(".")[0])
        steps.append(step)
        mmlu_scores.append(data["results"]["mmlu"]["acc"])

# Plot
plt.plot(steps, mmlu_scores)
plt.xlabel("Training Step")
plt.ylabel("MMLU Accuracy")
plt.title("Training Progress")
plt.savefig("training_curve.png")

Workflow 3: Compare multiple models

Benchmark suite for model comparison.

Model Comparison:
- [ ] Step 1: Define model list
- [ ] Step 2: Run evaluations
- [ ] Step 3: Generate comparison table

Step 1: Define model list

# models.txt
meta-llama/Llama-2-7b-hf
meta-llama/Llama-2-13b-hf
mistralai/Mistral-7B-v0.1
microsoft/phi-2

Step 2: Run evaluations

#!/bin/bash
# eval_all_models.sh

TASKS="mmlu,gsm8k,hellaswag,truthfulqa"

while read model; do
    echo "Evaluating $model"

    # Extract model name for output file
    model_name=$(echo $model | sed 's/\//-/g')

    lm_eval --model hf \
      --model_args pretrained=$model,dtype=bfloat16 \
      --tasks $TASKS \
      --num_fewshot 5 \
      --batch_size auto \
      --output_path results/$model_name.json

done < models.txt

Step 3: Generate comparison table

import json
import pandas as pd

models = [
    "meta-llama-Llama-2-7b-hf",
    "meta-llama-Llama-2-13b-hf",
    "mistralai-Mistral-7B-v0.1",
    "microsoft-phi-2"
]

tasks = ["mmlu", "gsm8k", "hellaswag", "truthfulqa"]

results = []
for model in models:
    with open(f"results/{model}.json") as f:
        data = json.load(f)
        row = {"Model": model.replace("-", "/")}
        for task in tasks:
            # Get primary metric for each task
            metrics = data["results"][task]
            if "acc" in metrics:
                row[task.upper()] = f"{metrics['acc']:.3f}"
            elif "exact_match" in metrics:
                row[task.upper()] = f"{metrics['exact_match']:.3f}"
        results.append(row)

df = pd.DataFrame(results)
print(df.to_markdown(index=False))

Output:

| Model                  | MMLU  | GSM8K | HELLASWAG | TRUTHFULQA |
|------------------------|-------|-------|-----------|------------|
| meta-llama/Llama-2-7b  | 0.459 | 0.142 | 0.765     | 0.391      |
| meta-llama/Llama-2-13b | 0.549 | 0.287 | 0.801     | 0.430      |
| mistralai/Mistral-7B   | 0.626 | 0.395 | 0.812     | 0.428      |
| microsoft/phi-2        | 0.560 | 0.613 | 0.682     | 0.447      |

Workflow 4: Evaluate with vLLM (faster inference)

Use vLLM backend for 5-10x faster evaluation.

vLLM Evaluation:
- [ ] Step 1: Install vLLM
- [ ] Step 2: Configure vLLM backend
- [ ] Step 3: Run evaluation

Step 1: Install vLLM

pip install vllm

Step 2: Configure vLLM backend

lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.8 \
  --tasks mmlu \
  --batch_size auto

Step 3: Run evaluation

vLLM is 5-10× faster than standard HuggingFace:

# Standard HF: ~2 hours for MMLU on 7B model
lm_eval --model hf \
  --model_args pretrained=meta-llama/Llama-2-7b-hf \
  --tasks mmlu \
  --batch_size 8

# vLLM: ~15-20 minutes for MMLU on 7B model
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-2-7b-hf,tensor_parallel_size=2 \
  --tasks mmlu \
  --batch_size auto

When to use vs alternatives

Use lm-evaluation-harness when:

  • Benchmarking models for academic papers
  • Comparing model quality across standard tasks
  • Tracking training progress
  • Reporting standardized metrics (everyone uses same prompts)
  • Need reproducible evaluation

Use alternatives instead:

  • HELM (Stanford): Broader evaluation (fairness, efficiency, calibration)
  • AlpacaEval: Instruction-following evaluation with LLM judges
  • MT-Bench: Conversational multi-turn evaluation
  • Custom scripts: Domain-specific evaluation

Common issues

Issue: Evaluation too slow

Use vLLM backend:

lm_eval --model vllm \
  --model_args pretrained=model-name,tensor_parallel_size=2

Or reduce fewshot examples:

--num_fewshot 0  # Instead of 5

Or evaluate subset of MMLU:

--tasks mmlu_stem  # Only STEM subjects

Issue: Out of memory

Reduce batch size:

--batch_size 1  # Or --batch_size auto

Use quantization:

--model_args pretrained=model-name,load_in_8bit=True

Enable CPU offloading:

--model_args pretrained=model-name,device_map=auto,offload_folder=offload

Issue: Different results than reported

Check fewshot count:

--num_fewshot 5  # Most papers use 5-shot

Check exact task name:

--tasks mmlu  # Not mmlu_direct or mmlu_fewshot

Verify model and tokenizer match:

--model_args pretrained=model-name,tokenizer=same-model-name

Issue: HumanEval not executing code

Install execution dependencies:

pip install human-eval

Enable code execution:

lm_eval --model hf \
  --model_args pretrained=model-name \
  --tasks humaneval \
  --allow_code_execution  # Required for HumanEval

Advanced topics

Benchmark descriptions: See references/benchmark-guide.md for detailed description of all 60+ tasks, what they measure, and interpretation.

Custom tasks: See references/custom-tasks.md for creating domain-specific evaluation tasks.

API evaluation: See references/api-evaluation.md for evaluating OpenAI, Anthropic, and other API models.

Multi-GPU strategies: See references/distributed-eval.md for data parallel and tensor parallel evaluation.

Hardware requirements

  • GPU: NVIDIA (CUDA 11.8+), works on CPU (very slow)
  • VRAM:
    • 7B model: 16GB (bf16) or 8GB (8-bit)
    • 13B model: 28GB (bf16) or 14GB (8-bit)
    • 70B model: Requires multi-GPU or quantization
  • Time (7B model, single A100):
    • HellaSwag: 10 minutes
    • GSM8K: 5 minutes
    • MMLU (full): 2 hours
    • HumanEval: 20 minutes

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/11-evaluation/lm-evaluation-harness/SKILL.mdView on GitHub

Overview

Evaluates LLMs using standardized prompts and metrics across 60+ benchmarks (e.g., MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). This makes it easy to benchmark model quality, compare models, report academic results, and track training progress. It is the industry standard used by EleutherAI, HuggingFace, and major labs, with support for HuggingFace, vLLM, and APIs.

How This Skill Works

Install the harness (pip install lm-eval), configure a model backend (e.g., HuggingFace HF models or other backends), select tasks (mmlu, gsm8k, hellaswag, etc.), run the evaluation to produce per-task metrics and a combined result JSON, and review config metadata. Results include per-task scores and a saved config block to help reproduce experiments.

When to Use It

  • Benchmark model quality on standard suites
  • Compare multiple models side-by-side in a single run
  • Publish or report academic results with per-task metrics
  • Track training progress via periodic evaluations
  • Validate performance after fine-tuning or architecture changes

Quick Start

  1. Step 1: Install lm-eval (pip install lm-eval)
  2. Step 2: Run a core benchmark on a HuggingFace model using the example: lm_eval --model hf --model_args pretrained=MODEL --tasks mmlu,gsm8k,hellaswag --device cuda:0 --batch_size 8
  3. Step 3: View and analyze results saved in a JSON file or results directory

Best Practices

  • Use the Standard suite for model releases: mmlu, gsm8k, hellaswag, truthfulqa, arc_challenge
  • Auto-detect batch size when possible and choose appropriate few-shot settings
  • Record and share the run configuration (model_args, tasks, device) for reproducibility
  • Save full results JSON and summaries to facilitate reporting and paper drafts
  • Run multiple seeds or task subsets to assess stability and variance

Example Use Cases

  • Benchmarking a HuggingFace model (e.g., Llama-2-7b-hf) across mmlu, gsm8k, hellaswag to compare with peers
  • Tracking training progress by evaluating checkpoints at regular intervals
  • Publishing a model-paper with per-task accuracies and confidence intervals
  • Generating and sharing a results JSON file like results/llama2-7b-eval.json
  • Cross-lab benchmarking to validate industry-standard metrics across models

Frequently Asked Questions

Add this skill to your agents

Related Skills

Sponsor this space

Reach thousands of developers