Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/nemo-evaluator --openclaw
Files (1)
SKILL.md
11.9 KB

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Set API key and run evaluation:

export NGC_API_KEY=nvapi-your-key-here

# Create minimal config
cat > config.yaml << 'EOF'
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

evaluation:
  tasks:
    - name: ifeval
EOF

# Run evaluation
nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks:

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist:

Standard Evaluation:
- [ ] Step 1: Configure API endpoint
- [ ] Step 2: Select benchmarks
- [ ] Step 3: Run evaluation
- [ ] Step 4: Check results

Step 1: Configure API endpoint

# config.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./results

target:
  api_endpoint:
    model_id: meta/llama-3.1-8b-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions
    api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target:
  api_endpoint:
    model_id: my-model
    url: http://localhost:8000/v1/chat/completions
    api_key_name: ""  # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation:
  tasks:
    - name: ifeval           # Instruction following
    - name: gpqa_diamond     # Graduate-level QA
      env_vars:
        HF_TOKEN: HF_TOKEN   # Some tasks need HF token
    - name: gsm8k_cot_instruct  # Math reasoning
    - name: humaneval        # Code generation

Step 3: Run evaluation

# Run with config file
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config

# Override output directory
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o execution.output_dir=./my_results

# Limit samples for quick testing
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name config \
  -o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

# Check job status
nemo-evaluator-launcher status <invocation_id>

# List all runs
nemo-evaluator-launcher ls runs

# View results
cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist:

Slurm Evaluation:
- [ ] Step 1: Configure Slurm settings
- [ ] Step 2: Set up model deployment
- [ ] Step 3: Launch evaluation
- [ ] Step 4: Monitor job status

Step 1: Configure Slurm settings

# slurm_config.yaml
defaults:
  - execution: slurm
  - deployment: vllm
  - _self_

execution:
  hostname: cluster.example.com
  account: my_slurm_account
  partition: gpu
  output_dir: /shared/results
  walltime: "04:00:00"
  nodes: 1
  gpus_per_node: 8

Step 2: Set up model deployment

deployment:
  checkpoint_path: /shared/models/llama-3.1-8b
  tensor_parallel_size: 2
  data_parallel_size: 4
  max_model_len: 4096

target:
  api_endpoint:
    model_id: llama-3.1-8b
    # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run \
  --config-dir . \
  --config-name slurm_config

Step 4: Monitor job status

# Check status (queries sacct)
nemo-evaluator-launcher status <invocation_id>

# View detailed info
nemo-evaluator-launcher info <invocation_id>

# Kill if needed
nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist:

Model Comparison:
- [ ] Step 1: Create base config
- [ ] Step 2: Run evaluations with overrides
- [ ] Step 3: Export and compare results

Step 1: Create base config

# base_eval.yaml
defaults:
  - execution: local
  - deployment: none
  - _self_

execution:
  output_dir: ./comparison_results

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.01
        parallelism: 4
  tasks:
    - name: mmlu_pro
    - name: gsm8k_cot_instruct
    - name: ifeval

Step 2: Run evaluations with model overrides

# Evaluate Llama 3.1 8B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

# Evaluate Mistral 7B
nemo-evaluator-launcher run \
  --config-dir . \
  --config-name base_eval \
  -o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3 \
  -o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

# Export to MLflow
nemo-evaluator-launcher export <invocation_id_1> --dest mlflow
nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

# Export to local JSON
nemo-evaluator-launcher export <invocation_id> --dest local --format json

# Export to Weights & Biases
nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist:

Safety/VLM Evaluation:
- [ ] Step 1: Configure safety tasks
- [ ] Step 2: Set up VLM tasks (if applicable)
- [ ] Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation:
  tasks:
    - name: aegis              # Safety harness
    - name: wildguard          # Safety classification
    - name: garak              # Security probing

Step 2: Configure VLM tasks

# For vision-language models
target:
  api_endpoint:
    type: vlm  # Vision-language endpoint
    model_id: nvidia/llama-3.2-90b-vision-instruct
    url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation:
  tasks:
    - name: ocrbench           # OCR evaluation
    - name: chartqa            # Chart understanding
    - name: mmmu               # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

  • Need 100+ benchmarks from 18+ harnesses in one platform
  • Running evaluations on Slurm HPC clusters or cloud
  • Requiring reproducible containerized evaluation
  • Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)
  • Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

  • lm-evaluation-harness: Simpler setup for quick local evaluation
  • bigcode-evaluation-harness: Focused only on code benchmarks
  • HELM: Stanford's broader evaluation (fairness, efficiency)
  • Custom scripts: Highly specialized domain evaluation

Supported Harnesses and Tasks

HarnessTask CountCategories
lm-evaluation-harness60+MMLU, GSM8K, HellaSwag, ARC
simple-evals20+GPQA, MATH, AIME
bigcode-evaluation-harness25+HumanEval, MBPP, MultiPL-E
safety-harness3Aegis, WildGuard
garak1Security probing
vlmevalkit6+OCRBench, ChartQA, MMMU
bfcl6Function calling v2/v3
mtbench2Multi-turn conversation
livecodebench10+Live coding evaluation
helm15Medical domain
nemo-skills8Math, science, agentic

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation:
  tasks:
    - name: gpqa_diamond
      env_vars:
        HF_TOKEN: HF_TOKEN  # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution:
  account: correct_account
  partition: gpu
  qos: normal  # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation:
  nemo_evaluator_config:
    config:
      params:
        temperature: 0.0  # Deterministic
        num_fewshot: 5    # Check paper's fewshot count

CLI Reference

CommandDescription
runExecute evaluation with config
status <id>Check job status
info <id>View detailed job info
ls tasksList available benchmarks
ls runsList all invocations
export <id>Export results (mlflow/wandb/local)
kill <id>Terminate running job

Configuration Override Examples

# Override model endpoint
-o target.api_endpoint.model_id=my-model
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

# Add evaluation parameters
-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5
-o +evaluation.nemo_evaluator_config.config.params.parallelism=8
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

# Change execution settings
-o execution.output_dir=/custom/path
-o execution.mode=parallel

# Dynamically set tasks
-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate
from nemo_evaluator.api.api_dataclasses import (
    EvaluationConfig,
    EvaluationTarget,
    ApiEndpoint,
    EndpointType,
    ConfigParams
)

# Configure evaluation
eval_config = EvaluationConfig(
    type="mmlu_pro",
    output_dir="./results",
    params=ConfigParams(
        limit_samples=10,
        temperature=0.0,
        max_new_tokens=1024,
        parallelism=4
    )
)

# Configure target endpoint
target_config = EvaluationTarget(
    api_endpoint=ApiEndpoint(
        model_id="meta/llama-3.1-8b-instruct",
        url="https://integrate.api.nvidia.com/v1/chat/completions",
        type=EndpointType.CHAT,
        api_key="nvapi-your-key-here"
    )
)

# Run evaluation
result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Advanced Topics

Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md

Requirements

  • Python: 3.10-3.13
  • Docker: Required for local execution
  • NGC API Key: For pulling containers and using NVIDIA Build
  • HF_TOKEN: Required for some benchmarks (GPQA, MMLU)

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/11-evaluation/nemo-evaluator/SKILL.mdView on GitHub

Overview

NeMo Evaluator SDK is an enterprise-grade benchmarking platform that evaluates LLMs across 100+ benchmarks from 18+ harnesses (e.g., MMLU, HumanEval, GSM8K, safety, VLM). It enables scalable, reproducible benchmarking on local Docker, Slurm HPC, or cloud environments with a container-first architecture.

How This Skill Works

The SDK uses containerized evaluation driven by nemo-evaluator-launcher. You provide a config.yaml that specifies execution (local, slurm, etc.), deployment, and a target API endpoint; running nemo-evaluator-launcher run executes the chosen benchmarks across backends, producing results in a defined output directory.

When to Use It

  • Need scalable evaluation across 100+ benchmarks for model comparison.
  • Work entirely within local Docker environments for reproducible results.
  • Run large-scale evaluations on Slurm HPC clusters.
  • Deploy and evaluate against cloud endpoints or private cloud (e.g., Lepton).
  • Require enterprise-grade, container-first benchmarking with reproducible workflows.

Quick Start

  1. Step 1: Install the launcher: pip install nemo-evaluator-launcher
  2. Step 2: Set API key and create a minimal config.yaml with execution, deployment, and target/api_endpoint details
  3. Step 3: Run evaluation: nemo-evaluator-launcher run --config-dir . --config-name config

Best Practices

  • Use containerized evaluation to ensure reproducibility across environments.
  • Start with a minimal config (defaults) and clearly set output_dir for results.
  • Store sensitive tokens (e.g., HF_TOKEN) as environment variables when needed.
  • Validate the API endpoint configuration before launching runs.
  • Monitor progress with nemo-evaluator-launcher status and review results in output_dir.

Example Use Cases

  • A research team benchmarks a new model across MMLU, GSM8K, and HumanEval using local Docker for rapid iteration.
  • An enterprise runs 100+ benchmarks on a Slurm HPC cluster to compare multiple model variants.
  • A team self-hosts endpoints with vLLM or TRT-LLM and evaluates privately without external calls.
  • Cloud-scale benchmarking using Lepton cloud to distribute tasks across multiple regions.
  • Reproducible benchmarking for regulatory or auditing needs with container-first deployments.

Frequently Asked Questions

Add this skill to your agents

Related Skills

phoenix-observability

Orchestra-Research/AI-Research-SKILLs

Open-source AI observability platform for LLM tracing, evaluation, and monitoring. Use when debugging LLM applications with detailed traces, running evaluations on datasets, or monitoring production AI systems with real-time insights.

evaluating-code-models

Orchestra-Research/AI-Research-SKILLs

Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.

langsmith-observability

Orchestra-Research/AI-Research-SKILLs

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

nemo-guardrails

Orchestra-Research/AI-Research-SKILLs

NVIDIA's runtime safety framework for LLM applications. Features jailbreak detection, input/output validation, fact-checking, hallucination detection, PII filtering, toxicity detection. Uses Colang 2.0 DSL for programmable rails. Production-ready, runs on T4 GPU.

nemo-curator

Orchestra-Research/AI-Research-SKILLs

GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Features fuzzy deduplication (16× faster), quality filtering (30+ heuristics), semantic deduplication, PII redaction, NSFW detection. Scales across GPUs with RAPIDS. Use for preparing high-quality training datasets, cleaning web data, or deduplicating large corpora.

tensorrt-llm

Orchestra-Research/AI-Research-SKILLs

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

Sponsor this space

Reach thousands of developers