Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/miles --openclaw
Files (1)
SKILL.md
8.7 KB

miles: Enterprise-Grade RL for Large-Scale Model Training

miles is a high-performance, enterprise-ready RL framework optimized for large-scale model post-training. Built as a production fork of slime, it addresses critical challenges in MoE training stability, low-precision training, and train-inference alignment.

When to Use miles

Choose miles when you need:

  • Training 1TB+ MoE models (DeepSeek V3, Qwen3-MoE)
  • FP8 or INT4 quantization-aware training
  • Bit-wise identical train-inference alignment
  • Speculative RL for maximum throughput
  • Production stability with enterprise support

Consider alternatives when:

  • You want the research-grade original → use slime
  • You need flexible backend swapping → use verl
  • You want PyTorch-native abstractions → use torchforge

Key Features

Low-Precision Training

  • Unified FP8: End-to-end FP8 for both inference and training
  • INT4 QAT: 1TB models on single-machine VRAM (H200)
  • Rollout Routing Replay (R3): Bit-wise expert alignment for MoE

Performance Optimizations

  • Speculative RL: 25%+ rollout speedup with online SFT draft models
  • Zero-Copy Weight Sync: CUDA IPC zero-copy mapping
  • Partial Rollout: Recycle half-finished trajectories

Train-Inference Alignment

  • TIS/MIS: Truncated/Masked Importance Sampling for off-policy correction
  • Kernel-level optimization: FlashAttention-3, DeepGEMM integration

Installation

# Recommended: Docker
docker pull radixark/miles:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it radixark/miles:latest /bin/bash

# From source
git clone https://github.com/radixark/miles.git
cd miles
pip install -r requirements.txt
pip install -e .

Quick Start

miles inherits slime's configuration system. Basic training:

python train.py \
    --advantage-estimator grpo \
    --model-name qwen3-30b-a3b \
    --hf-checkpoint /path/to/qwen3-30b-a3b-hf \
    --rollout-batch-size 512 \
    --n-samples-per-prompt 8

Workflow 1: Large MoE Training

Use this workflow for training large MoE models like DeepSeek V3 or Qwen3-MoE.

Prerequisites Checklist

  • H100/H200 GPUs with FP8 support
  • MoE model (DeepSeek V3, Qwen3-MoE)
  • Docker environment with miles

Step 1: Environment Setup

# FP8 block scaling (recommended for stability)
export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
export CUDA_DEVICE_MAX_CONNECTIONS=1

Step 2: Configure Training

python train.py \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --hf-checkpoint /path/to/deepseek-v3 \
    --advantage-estimator grpo \
    --tensor-model-parallel-size 8 \
    --expert-model-parallel-size 4 \
    --prompt-data /path/to/data.jsonl \
    --num-rollout 3000

Verification Checklist

  • Model loads without errors
  • Routing decisions are consistent
  • No NaN/Inf in loss values

Workflow 2: Speculative RL Training

Use this workflow for maximum rollout throughput with EAGLE speculative decoding.

How Speculative RL Works

  1. Small draft model generates candidate tokens
  2. Target model verifies in parallel
  3. Draft model updated via online SFT to track policy

Step 1: Enable Speculative Decoding

miles supports EAGLE speculative decoding via SGLang:

python train.py \
    --actor-num-gpus-per-node 8 \
    --hf-checkpoint /path/to/target-model \
    --sglang-speculative-algorithm EAGLE \
    --sglang-speculative-num-steps 3 \
    --sglang-speculative-eagle-topk 1 \
    --sglang-speculative-num-draft-tokens 4 \
    --sglang-speculative-draft-model-path /path/to/draft-model \
    --advantage-estimator grpo \
    --prompt-data /path/to/data.jsonl

Step 2: Enable Online MTP Training (Optional)

For online SFT of draft model during training:

--mtp-num-layers 1 \
--enable-mtp-training \
--mtp-loss-scaling-factor 0.2

Note: Online MTP training requires a torch dist checkpoint with MTP weights. Add --mtp-num-layers 1 during checkpoint conversion from HuggingFace.

Expected Speedup

  • Standard rollout: Baseline
  • Speculative RL: 25-40% faster rollout
  • With partial rollout: Additional 10-15% throughput

Configuration Reference

miles inherits all slime arguments. See slime API Reference for the complete list.

Cluster Resources (from slime)

--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--rollout-num-gpus-per-engine 2
--colocate

Megatron Parallelism (from slime)

--tensor-model-parallel-size 8
--pipeline-model-parallel-size 2
--expert-model-parallel-size 4    # MoE expert parallelism

Speculative Decoding (miles-specific)

--sglang-speculative-algorithm EAGLE
--sglang-speculative-num-steps 3
--sglang-speculative-eagle-topk 1
--sglang-speculative-num-draft-tokens 4
--sglang-enable-draft-weights-cpu-backup
--sglang-speculative-draft-model-path /your/draft/model/path

Online MTP Training (miles-specific)

--mtp-num-layers 1
--enable-mtp-training
--mtp-loss-scaling-factor 0.2

Key Features (Conceptual)

The following features are documented in miles but specific CLI flags may vary. Consult the miles repository for latest configuration.

Unified FP8 Pipeline

End-to-end FP8 sampling and training that eliminates quantization-induced discrepancy causing RL collapse in MoE models.

Rollout Routing Replay (R3)

Records expert routing decisions during SGLang inference and replays them during Megatron training for bit-wise expert alignment.

How R3 Works:

  1. During SGLang inference, expert routing decisions are recorded
  2. Routing decisions stored in sample.rollout_routed_experts
  3. During Megatron training, routing is replayed instead of recomputed
  4. Ensures identical expert selection between train and inference

INT4 Quantization-Aware Training

Enables single-machine deployment of 1TB+ models (e.g., on H200).

Memory Savings with INT4:

Model SizeBF16 VRAMINT4 VRAMReduction
70B140GB45GB3.1x
235B470GB150GB3.1x
671B1.3TB420GB3.1x

Train-Inference Alignment

miles achieves "exactly 0 KL divergence" between training and inference through:

  • Flash Attention 3
  • DeepGEMM
  • Batch-invariant kernels from Thinking Machines Lab
  • torch.compile integration

Sample Data Structure

miles uses the same Sample dataclass as slime with the rollout_routed_experts field for MoE routing replay:

@dataclass
class Sample:
    prompt: str | list[dict]
    tokens: list[int]
    response: str
    reward: float | dict
    loss_mask: list[int]
    status: Status
    metadata: dict
    rollout_log_probs: list[float]
    rollout_routed_experts: list[list[int]]  # MoE routing for R3

See slime API Reference for the complete Sample definition.


Common Issues and Solutions

Issue: FP8 Training Collapse

Symptoms: Loss explodes, NaN values

Solutions:

  • Use block scaling: export NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1
  • Reduce learning rate: --lr 5e-7
  • Ensure MoE routing is consistent between train/inference

Issue: Speculative Draft Drift

Symptoms: Low acceptance rate over time

Solutions:

  • Enable online MTP training to keep draft model aligned
  • Reduce speculative steps: --sglang-speculative-num-steps 2
  • Use CPU backup: --sglang-enable-draft-weights-cpu-backup

Issue: Train-Inference Mismatch

Symptoms: Policy divergence, reward collapse

Solutions:

  • Use TIS for off-policy correction: --use-tis --tis-threshold 0.9
  • Verify log probs match between SGLang and Megatron
  • Enable R3 for MoE models

Supported Models

FamilyModelsMoE Support
DeepSeekR1, V3, V3.2Full
Qwen2, 2.5, 3 (including MoE)Full
Llama3, 3.1, 3.3, 4Dense only
Gemma2, 3, 3NDense only
GLM4.5, 4.6, 4.7Dense only
MiniMaxM2, M2.1Full

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/06-post-training/miles/SKILL.mdView on GitHub

Overview

miles is an enterprise-ready RL framework built as a production fork of slime, optimized for large-scale model post-training. It addresses MoE training stability, low-precision training, and train-inference alignment, and supports speculative RL for maximum throughput.

How This Skill Works

miles implements end-to-end FP8 training and INT4 quantization-aware training, with Rollout Routing Replay (R3) for MoE alignment. It includes train-inference alignment using TIS/MIS and kernel-level optimizations like FlashAttention-3 and DeepGEMM, plus zero-copy weight syncing for high throughput. It also supports speculative RL workflows (EAGLE) to increase rollout speed.

When to Use It

  • Training 1TB+ MoE models (e.g., DeepSeek V3, Qwen3-MoE)
  • FP8 or INT4 quantization-aware training
  • Bit-wise identical train-inference alignment
  • Speculative RL for maximum throughput
  • Production stability with enterprise support

Quick Start

  1. Step 1: Install miles (Docker or from source)
  2. Step 2: Run a basic training job with the example command: python train.py \ --advantage-estimator grpo \ --model-name qwen3-30b-a3b \ --hf-checkpoint /path/to/qwen3-30b-a3b-hf \ --rollout-batch-size 512 \ --n-samples-per-prompt 8
  3. Step 3: Verify outputs and adjust rollout configs and FP8/INT4 settings

Best Practices

  • Ensure GPUs with FP8 support (H100/H200) are available
  • Leverage miles workflows for large MoE training and verify model routing decisions
  • Utilize Zero-Copy Weight Sync for fast inter-process communication
  • Enable Speculative RL with EAGLE carefully and monitor online SFT drift
  • Validate train-inference alignment with TIS/MIS and kernel optimizations (FlashAttention-3, DeepGEMM)

Example Use Cases

  • Training DeepSeek V3 MoE with FP8 on miles
  • Training Qwen3-MoE at ~1TB scale on H200 with INT4 QAT
  • Speculative RL using EAGLE for higher rollout throughput
  • Train-inference alignment validated with TIS/MIS in production
  • Enterprise deployment of miles for long-running RL workloads

Frequently Asked Questions

Add this skill to your agents

Related Skills

sglang

Orchestra-Research/AI-Research-SKILLs

Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.

deepspeed

Orchestra-Research/AI-Research-SKILLs

Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention

optimizing-attention-flash

Orchestra-Research/AI-Research-SKILLs

Optimizes transformer attention with Flash Attention for 2-4x speedup and 10-20x memory reduction. Use when training/running transformers with long sequences (>512 tokens), encountering GPU memory issues with attention, or need faster inference. Supports PyTorch native SDPA, flash-attn library, H100 FP8, and sliding window attention.

grpo-rl-training

Orchestra-Research/AI-Research-SKILLs

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

moe-training

Orchestra-Research/AI-Research-SKILLs

Train Mixture of Experts (MoE) models using DeepSpeed or HuggingFace. Use when training large-scale models with limited compute (5× cost reduction vs dense models), implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3, or scaling model capacity without proportional compute increase. Covers MoE architectures, routing mechanisms, load balancing, expert parallelism, and inference optimization.

fine-tuning-with-trl

Orchestra-Research/AI-Research-SKILLs

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

Sponsor this space

Reach thousands of developers