evaluating-code-models
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/bigcode-evaluation-harness --openclawBigCode Evaluation Harness - Code Model Benchmarking
Quick Start
BigCode Evaluation Harness evaluates code generation models across 15+ benchmarks including HumanEval, MBPP, and MultiPL-E (18 languages).
Installation:
git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
cd bigcode-evaluation-harness
pip install -e .
accelerate config
Evaluate on HumanEval:
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--n_samples 20 \
--batch_size 10 \
--allow_code_execution \
--save_generations
View available tasks:
python -c "from bigcode_eval.tasks import ALL_TASKS; print(ALL_TASKS)"
Common Workflows
Workflow 1: Standard Code Benchmark Evaluation
Evaluate model on core code benchmarks (HumanEval, MBPP, HumanEval+).
Checklist:
Code Benchmark Evaluation:
- [ ] Step 1: Choose benchmark suite
- [ ] Step 2: Configure model and generation
- [ ] Step 3: Run evaluation with code execution
- [ ] Step 4: Analyze pass@k results
Step 1: Choose benchmark suite
Python code generation (most common):
- HumanEval: 164 handwritten problems, function completion
- HumanEval+: Same 164 problems with 80× more tests (stricter)
- MBPP: 500 crowd-sourced problems, entry-level difficulty
- MBPP+: 399 curated problems with 35× more tests
Multi-language (18 languages):
- MultiPL-E: HumanEval/MBPP translated to C++, Java, JavaScript, Go, Rust, etc.
Advanced:
- APPS: 10,000 problems (introductory/interview/competition)
- DS-1000: 1,000 data science problems across 7 libraries
Step 2: Configure model and generation
# Standard HuggingFace model
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--max_length_generation 512 \
--temperature 0.2 \
--do_sample True \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution
# Quantized model (4-bit)
accelerate launch main.py \
--model codellama/CodeLlama-34b-hf \
--tasks humaneval \
--load_in_4bit \
--max_length_generation 512 \
--allow_code_execution
# Custom/private model
accelerate launch main.py \
--model /path/to/my-code-model \
--tasks humaneval \
--trust_remote_code \
--use_auth_token \
--allow_code_execution
Step 3: Run evaluation
# Full evaluation with pass@k estimation (k=1,10,100)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks humaneval \
--temperature 0.8 \
--n_samples 200 \
--batch_size 50 \
--allow_code_execution \
--save_generations \
--metric_output_path results/starcoder2-humaneval.json
Step 4: Analyze results
Results in results/starcoder2-humaneval.json:
{
"humaneval": {
"pass@1": 0.354,
"pass@10": 0.521,
"pass@100": 0.689
},
"config": {
"model": "bigcode/starcoder2-7b",
"temperature": 0.8,
"n_samples": 200
}
}
Workflow 2: Multi-Language Evaluation (MultiPL-E)
Evaluate code generation across 18 programming languages.
Checklist:
Multi-Language Evaluation:
- [ ] Step 1: Generate solutions (host machine)
- [ ] Step 2: Run evaluation in Docker (safe execution)
- [ ] Step 3: Compare across languages
Step 1: Generate solutions on host
# Generate without execution (safe)
accelerate launch main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--max_length_generation 650 \
--temperature 0.8 \
--n_samples 50 \
--batch_size 50 \
--generation_only \
--save_generations \
--save_generations_path generations_multi.json
Step 2: Evaluate in Docker container
# Pull the MultiPL-E Docker image
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
# Run evaluation inside container
docker run -v $(pwd)/generations_multi.json:/app/generations.json:ro \
-it evaluation-harness-multiple python3 main.py \
--model bigcode/starcoder2-7b \
--tasks multiple-py,multiple-js,multiple-java,multiple-cpp \
--load_generations_path /app/generations.json \
--allow_code_execution \
--n_samples 50
Supported languages: Python, JavaScript, Java, C++, Go, Rust, TypeScript, C#, PHP, Ruby, Swift, Kotlin, Scala, Perl, Julia, Lua, R, Racket
Workflow 3: Instruction-Tuned Model Evaluation
Evaluate chat/instruction models with proper formatting.
Checklist:
Instruction Model Evaluation:
- [ ] Step 1: Use instruction-tuned tasks
- [ ] Step 2: Configure instruction tokens
- [ ] Step 3: Run evaluation
Step 1: Choose instruction tasks
- instruct-humaneval: HumanEval with instruction prompts
- humanevalsynthesize-{lang}: HumanEvalPack synthesis tasks
Step 2: Configure instruction tokens
# For models with chat templates (e.g., CodeLlama-Instruct)
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks instruct-humaneval \
--instruction_tokens "<s>[INST],</s>,[/INST]" \
--max_length_generation 512 \
--allow_code_execution
Step 3: HumanEvalPack for instruction models
# Test code synthesis across 6 languages
accelerate launch main.py \
--model codellama/CodeLlama-7b-Instruct-hf \
--tasks humanevalsynthesize-python,humanevalsynthesize-js \
--prompt instruct \
--max_length_generation 512 \
--allow_code_execution
Workflow 4: Compare Multiple Models
Benchmark suite for model comparison.
Step 1: Create evaluation script
#!/bin/bash
# eval_models.sh
MODELS=(
"bigcode/starcoder2-7b"
"codellama/CodeLlama-7b-hf"
"deepseek-ai/deepseek-coder-6.7b-base"
)
TASKS="humaneval,mbpp"
for model in "${MODELS[@]}"; do
model_name=$(echo $model | tr '/' '-')
echo "Evaluating $model"
accelerate launch main.py \
--model $model \
--tasks $TASKS \
--temperature 0.2 \
--n_samples 20 \
--batch_size 20 \
--allow_code_execution \
--metric_output_path results/${model_name}.json
done
Step 2: Generate comparison table
import json
import pandas as pd
models = ["bigcode-starcoder2-7b", "codellama-CodeLlama-7b-hf", "deepseek-ai-deepseek-coder-6.7b-base"]
results = []
for model in models:
with open(f"results/{model}.json") as f:
data = json.load(f)
results.append({
"Model": model,
"HumanEval pass@1": f"{data['humaneval']['pass@1']:.3f}",
"MBPP pass@1": f"{data['mbpp']['pass@1']:.3f}"
})
df = pd.DataFrame(results)
print(df.to_markdown(index=False))
When to Use vs Alternatives
Use BigCode Evaluation Harness when:
- Evaluating code generation models specifically
- Need multi-language evaluation (18 languages via MultiPL-E)
- Testing functional correctness with unit tests (pass@k)
- Benchmarking for BigCode/HuggingFace leaderboards
- Evaluating fill-in-the-middle (FIM) capabilities
Use alternatives instead:
- lm-evaluation-harness: General LLM benchmarks (MMLU, GSM8K, HellaSwag)
- EvalPlus: Stricter HumanEval+/MBPP+ with more test cases
- SWE-bench: Real-world GitHub issue resolution
- LiveCodeBench: Contamination-free, continuously updated problems
- CodeXGLUE: Code understanding tasks (clone detection, defect prediction)
Supported Benchmarks
| Benchmark | Problems | Languages | Metric | Use Case |
|---|---|---|---|---|
| HumanEval | 164 | Python | pass@k | Standard code completion |
| HumanEval+ | 164 | Python | pass@k | Stricter evaluation (80× tests) |
| MBPP | 500 | Python | pass@k | Entry-level problems |
| MBPP+ | 399 | Python | pass@k | Stricter evaluation (35× tests) |
| MultiPL-E | 164×18 | 18 languages | pass@k | Multi-language evaluation |
| APPS | 10,000 | Python | pass@k | Competition-level |
| DS-1000 | 1,000 | Python | pass@k | Data science (pandas, numpy, etc.) |
| HumanEvalPack | 164×3×6 | 6 languages | pass@k | Synthesis/fix/explain |
| Mercury | 1,889 | Python | Efficiency | Computational efficiency |
Common Issues
Issue: Different results than reported in papers
Check these factors:
# 1. Verify n_samples (need 200 for accurate pass@k)
--n_samples 200
# 2. Check temperature (0.2 for greedy-ish, 0.8 for sampling)
--temperature 0.8
# 3. Verify task name matches exactly
--tasks humaneval # Not "human_eval" or "HumanEval"
# 4. Check max_length_generation
--max_length_generation 512 # Increase for longer problems
Issue: CUDA out of memory
# Use quantization
--load_in_8bit
# OR
--load_in_4bit
# Reduce batch size
--batch_size 1
# Set memory limit
--max_memory_per_gpu "20GiB"
Issue: Code execution hangs or times out
Use Docker for safe execution:
# Generate on host (no execution)
--generation_only --save_generations
# Evaluate in Docker
docker run ... --allow_code_execution --load_generations_path ...
Issue: Low scores on instruction models
Ensure proper instruction formatting:
# Use instruction-specific tasks
--tasks instruct-humaneval
# Set instruction tokens for your model
--instruction_tokens "<s>[INST],</s>,[/INST]"
Issue: MultiPL-E language failures
Use the dedicated Docker image:
docker pull ghcr.io/bigcode-project/evaluation-harness-multiple
Command Reference
| Argument | Default | Description |
|---|---|---|
--model | - | HuggingFace model ID or local path |
--tasks | - | Comma-separated task names |
--n_samples | 1 | Samples per problem (200 for pass@k) |
--temperature | 0.2 | Sampling temperature |
--max_length_generation | 512 | Max tokens (prompt + generation) |
--batch_size | 1 | Batch size per GPU |
--allow_code_execution | False | Enable code execution (required) |
--generation_only | False | Generate without evaluation |
--load_generations_path | - | Load pre-generated solutions |
--save_generations | False | Save generated code |
--metric_output_path | results.json | Output file for metrics |
--load_in_8bit | False | 8-bit quantization |
--load_in_4bit | False | 4-bit quantization |
--trust_remote_code | False | Allow custom model code |
--precision | fp32 | Model precision (fp32/fp16/bf16) |
Hardware Requirements
| Model Size | VRAM (fp16) | VRAM (4-bit) | Time (HumanEval, n=200) |
|---|---|---|---|
| 7B | 14GB | 6GB | ~30 min (A100) |
| 13B | 26GB | 10GB | ~1 hour (A100) |
| 34B | 68GB | 20GB | ~2 hours (A100) |
Resources
- GitHub: https://github.com/bigcode-project/bigcode-evaluation-harness
- Documentation: https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs
- BigCode Leaderboard: https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard
- HumanEval Dataset: https://huggingface.co/datasets/openai/openai_humaneval
- MultiPL-E: https://github.com/nuprl/MultiPL-E
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/11-evaluation/bigcode-evaluation-harness/SKILL.mdView on GitHub Overview
Evaluating-code-models benchmarks code-generation models across 15+ benchmarks (including HumanEval, MBPP, and MultiPL-E) using pass@k metrics. It helps you compare coding ability, test multi-language support, and measure overall code-generation quality, following the BigCode standard used on HuggingFace leaderboards.
How This Skill Works
The skill relies on the BigCode Evaluation Harness to run standardized code-generation tasks, compute pass@k scores, and produce a structured results report. It supports common tools like accelerate, transformers, and datasets, covers multiple languages via MultiPL-E, and outputs per-benchmark pass@k values plus a config block for reproducibility.
When to Use It
- Benchmark a new code-generation model against established baselines on HumanEval, MBPP, and MultiPL-E.
- Test multi-language support by evaluating translated tasks in 18 languages via MultiPL-E.
- Compare models under different generation settings (max_length, temperature, n_samples) and report pass@k.
- Produce leaderboard-ready results for HuggingFace-style benchmarks.
- Assess model quality across 15+ benchmarks to identify strengths and gaps.
Quick Start
- Step 1: Install and configure the harness (clone repo, pip install -e ., and run accelerate config).
- Step 2: Run an evaluation on a task (e.g., humaneval) with your model and desired generation settings, enabling code execution if needed.
- Step 3: Review results in the generated JSON file (e.g., results/your-model-humaneval.json) and inspect pass@1, pass@10, pass@100 along with the config block.
Best Practices
- Use identical task sets and generation configurations across models for fair comparisons.
- Report pass@1, pass@10, and pass@100 to enable consistent ranking.
- Enable code execution only when safe and necessary for evaluation.
- Document environment details: library versions, hardware, and driver configurations.
- Include the raw results JSON and the exact config (model, temperature, n_samples) to ensure reproducibility.
Example Use Cases
- Rank bigcode/starcoder2-7b against codellama-34b on Humaneval and MBPP.
- Compare cross-language performance with MultiPL-E across C++, Java, JavaScript, Go, and Rust.
- Measure improvements after fine-tuning with additional BigCode data and re-evaluate.
- Publish a HuggingFace-style leaderboard page for a newly developed model.
- Assess whether a model scales better with higher n_samples or lower temperature across benchmarks.
Frequently Asked Questions
Related Skills
phoenix-observability
Orchestra-Research/AI-Research-SKILLs
Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.
langsmith-observability
Orchestra-Research/AI-Research-SKILLs
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.
nemo-evaluator-sdk
Orchestra-Research/AI-Research-SKILLs
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
evaluating-llms-harness
Orchestra-Research/AI-Research-SKILLs
Evaluates LLMs across 60+ academic benchmarks (MMLU, HumanEval, GSM8K, TruthfulQA, HellaSwag). Use when benchmarking model quality, comparing models, reporting academic results, or tracking training progress. Industry standard used by EleutherAI, HuggingFace, and major labs. Supports HuggingFace, vLLM, APIs.