Get the FREE Ultimate OpenClaw Setup Guide →
npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/slime --openclaw
Files (1)
SKILL.md
11.3 KB

slime: LLM Post-Training Framework for RL Scaling

slime is an LLM post-training framework from Tsinghua's THUDM team, powering GLM-4.5, GLM-4.6, and GLM-4.7. It connects Megatron-LM for training with SGLang for high-throughput rollout generation.

When to Use slime

Choose slime when you need:

  • Megatron-LM native training with SGLang inference
  • Custom data generation workflows with flexible data buffers
  • Training GLM, Qwen3, DeepSeek V3, or Llama 3 models
  • Research-grade framework with production backing (Z.ai)

Consider alternatives when:

  • You need enterprise-grade stability features → use miles
  • You want flexible backend swapping → use verl
  • You need PyTorch-native abstractions → use torchforge

Key Features

  • Training: Megatron-LM with full parallelism support (TP, PP, DP, SP)
  • Rollout: SGLang-based high-throughput generation with router
  • Data Buffer: Flexible prompt management and sample storage
  • Models: GLM-4.x, Qwen3, DeepSeek V3/R1, Llama 3

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Data Buffer                          │
│ - Prompt initialization and management                  │
│ - Custom data generation and filtering                  │
│ - Rollout sample storage                                │
└─────────────┬───────────────────────────┬───────────────┘
              │                           │
┌─────────────▼───────────┐ ┌─────────────▼───────────────┐
│ Training (Megatron-LM)  │ │ Rollout (SGLang + Router)   │
│ - Actor model training  │ │ - Response generation       │
│ - Critic (optional)     │ │ - Reward/verifier output    │
│ - Weight sync to rollout│ │ - Multi-turn support        │
└─────────────────────────┘ └─────────────────────────────┘

Installation

# Recommended: Docker
docker pull slimerl/slime:latest
docker run --rm --gpus all --ipc=host --shm-size=16g \
  -it slimerl/slime:latest /bin/bash

# Inside container
cd /root/slime && pip install -e . --no-deps

From Source

git clone https://github.com/THUDM/slime.git
cd slime
pip install -r requirements.txt
pip install -e .

Quick Start: GRPO Training

# Source model configuration
source scripts/models/qwen3-4B.sh

# Launch training
python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 4 \
    --rollout-num-gpus 4 \
    --advantage-estimator grpo \
    --use-kl-loss --kl-loss-coef 0.001 \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --prompt-data /path/to/data.jsonl \
    ${MODEL_ARGS[@]} ${CKPT_ARGS[@]}

Workflow 1: Standard GRPO Training

Use this workflow for training reasoning models with group-relative advantages.

Prerequisites Checklist

  • Docker environment or Megatron-LM + SGLang installed
  • Model checkpoint (HuggingFace or Megatron format)
  • Training data in JSONL format

Step 1: Prepare Data

# data.jsonl format
{"prompt": "What is 2 + 2?", "label": "4"}
{"prompt": "Solve: 3x = 12", "label": "x = 4"}

Or with chat format:

{
    "prompt": [
        {"role": "system", "content": "You are a math tutor."},
        {"role": "user", "content": "What is 15 + 27?"}
    ],
    "label": "42"
}

Step 2: Configure Model

Choose a pre-configured model script:

# List available models
ls scripts/models/
# glm4-9B.sh, qwen3-4B.sh, qwen3-30B-A3B.sh, deepseek-v3.sh, llama3-8B.sh, ...

# Source your model
source scripts/models/qwen3-4B.sh

Step 3: Launch Training

python train.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --use-kl-loss \
    --kl-loss-coef 0.001 \
    --prompt-data /path/to/train.jsonl \
    --input-key prompt \
    --label-key label \
    --apply-chat-template \
    --rollout-batch-size 32 \
    --n-samples-per-prompt 8 \
    --global-batch-size 256 \
    --num-rollout 3000 \
    --save-interval 100 \
    --eval-interval 50 \
    ${MODEL_ARGS[@]}

Step 4: Monitor Training

  • Check TensorBoard: tensorboard --logdir outputs/
  • Verify reward curves are increasing
  • Monitor GPU utilization across nodes

Workflow 2: Asynchronous Training

Use async mode for higher throughput by overlapping rollout and training.

When to Use Async

  • Large models with long generation times
  • High GPU idle time in synchronous mode
  • Sufficient memory for buffering

Launch Async Training

python train_async.py \
    --actor-num-nodes 1 \
    --actor-num-gpus-per-node 8 \
    --rollout-num-gpus 8 \
    --advantage-estimator grpo \
    --async-buffer-size 4 \
    --prompt-data /path/to/train.jsonl \
    ${MODEL_ARGS[@]}

Async-Specific Parameters

--async-buffer-size 4        # Number of rollouts to buffer
--update-weights-interval 2  # Sync weights every N rollouts

Workflow 3: Multi-Turn Agentic Training

Use this workflow for training agents with tool use or multi-step reasoning.

Prerequisites

  • Custom generate function for multi-turn logic
  • Tool/environment interface

Step 1: Define Custom Generate Function

# custom_generate.py
async def custom_generate(args, samples, evaluation=False):
    """Multi-turn generation with tool calling."""
    for sample in samples:
        conversation = sample.prompt

        for turn in range(args.max_turns):
            # Generate response
            response = await generate_single(conversation)

            # Check for tool call
            tool_call = extract_tool_call(response)
            if tool_call:
                tool_result = execute_tool(tool_call)
                conversation.append({"role": "assistant", "content": response})
                conversation.append({"role": "tool", "content": tool_result})
            else:
                break

        sample.response = response
        sample.reward = compute_reward(sample)

    return samples

Step 2: Launch with Custom Function

python train.py \
    --custom-generate-function-path custom_generate.py \
    --max-turns 5 \
    --prompt-data /path/to/agent_data.jsonl \
    ${MODEL_ARGS[@]}

See examples/search-r1/ for a complete multi-turn search example.


Configuration Reference

Three Argument Categories

slime uses three types of arguments:

1. Megatron Arguments (passed directly):

--tensor-model-parallel-size 2
--pipeline-model-parallel-size 1
--num-layers 32
--hidden-size 4096

2. SGLang Arguments (prefixed with --sglang-):

--sglang-mem-fraction-static 0.8
--sglang-context-length 8192
--sglang-log-level INFO

3. slime Arguments:

# Resource allocation
--actor-num-nodes 1
--actor-num-gpus-per-node 8
--rollout-num-gpus 8
--colocate  # Share GPUs between training/inference

# Data
--prompt-data /path/to/data.jsonl
--input-key prompt
--label-key label

# Training loop
--num-rollout 3000
--rollout-batch-size 32
--n-samples-per-prompt 8
--global-batch-size 256

# Algorithm
--advantage-estimator grpo  # or: gspo, ppo, reinforce_plus_plus
--use-kl-loss
--kl-loss-coef 0.001

Key Constraints

rollout_batch_size × n_samples_per_prompt = global_batch_size × num_steps_per_rollout

Example: 32 × 8 = 256 × 1


Data Buffer System

slime's data buffer enables flexible data management:

Basic Data Source

class RolloutDataSource:
    def get_samples(self, num_samples):
        """Fetch prompts from dataset."""
        return self.dataset.sample(num_samples)

    def add_samples(self, samples):
        """Called after generation (no-op by default)."""
        pass

Buffered Data Source (Off-Policy)

class RolloutDataSourceWithBuffer(RolloutDataSource):
    def __init__(self):
        self.buffer = []

    def add_samples(self, samples):
        """Store generated samples for reuse."""
        self.buffer.extend(samples)

    def buffer_filter(self, args, buffer, num_samples):
        """Custom selection logic (prioritized, stratified, etc.)."""
        return select_best(buffer, num_samples)

Common Issues and Solutions

Issue: SGLang Engine Crash

Symptoms: Inference engine dies mid-training

Solutions:

# Enable fault tolerance
--use-fault-tolerance

# Increase memory allocation
--sglang-mem-fraction-static 0.85

# Reduce batch size
--rollout-batch-size 16

Issue: Weight Sync Timeout

Symptoms: Training hangs after rollout

Solutions:

# Increase sync interval
--update-weights-interval 5

# Use colocated mode (no network transfer)
--colocate

Issue: OOM During Training

Symptoms: CUDA OOM in backward pass

Solutions:

# Enable gradient checkpointing
--recompute-activations

# Reduce micro-batch size
--micro-batch-size 1

# Enable sequence parallelism
--sequence-parallel

Issue: Slow Data Loading

Symptoms: GPU idle during data fetch

Solutions:

# Increase data workers
--num-data-workers 4

# Use streaming dataset
--streaming-data

Supported Models

Model FamilyConfigurations
GLMGLM-4.5, GLM-4.6, GLM-4.7, GLM-Z1-9B
QwenQwen3 (4B, 8B, 30B-A3B), Qwen3-MoE, Qwen2.5
DeepSeekV3, V3.1, R1
LlamaLlama 3 (8B, 70B)
OthersKimi K2, Moonlight-16B

Each model has pre-configured scripts in scripts/models/.


Advanced Topics

Co-location Mode

Share GPUs between training and inference to reduce memory:

python train.py \
    --colocate \
    --actor-num-gpus-per-node 8 \
    --sglang-mem-fraction-static 0.4 \
    ${MODEL_ARGS[@]}

Custom Reward Model

# custom_rm.py
class CustomRewardModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)

    def compute_reward(self, prompts, responses):
        inputs = self.tokenize(prompts, responses)
        scores = self.model(inputs)
        return scores.tolist()
--custom-rm-path custom_rm.py

Evaluation Multi-Task

--eval-prompt-data aime /path/to/aime.jsonl \
--eval-prompt-data gsm8k /path/to/gsm8k.jsonl \
--n-samples-per-eval-prompt 16

Resources

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/06-post-training/slime/SKILL.mdView on GitHub

Overview

slime is an LLM post-training framework that bridges Megatron-LM training with SGLang-based rollout generation to enable RL-based refinement for GLMs and related models. It supports custom data generation workflows, flexible data buffers, and tight Megatron-LM integration for scalable RL post-training across models like GLM-4.x, Qwen3, DeepSeek V3/R1, and Llama 3.

How This Skill Works

Technically, slime runs Megatron-LM with full parallelism (TP, PP, DP, SP) and uses SGLang-based rollout with a router to generate high-throughput prompts. A flexible Data Buffer stores prompts and rollout samples, while the training loop consumes rollout feedback (rewards and verifier outputs) to guide RL optimization.

When to Use It

  • When you need Megatron-LM native training with SGLang inference
  • When you require custom data generation workflows with flexible data buffers
  • When training GLM, Qwen3, DeepSeek V3, or Llama 3 models
  • When you want a research-grade RL post-training framework with production backing (Z.ai)
  • When you need tight Megatron-LM integration for RL scaling across large GPUs

Quick Start

  1. Step 1: Source a model configuration, e.g. source scripts/models/qwen3-4B.sh
  2. Step 2: Launch GRPO training with a sample command, e.g. the provided python train.py block including --advantage-estimator grpo and --use-kl-loss --kl-loss-coef 0.001
  3. Step 3: Monitor training progress and iterate by adjusting --rollout-batch-size, --n-samples-per-prompt, --prompt-data, and other MODEL_ARGS/CKPT_ARGS as needed

Best Practices

  • Prefer a Docker-based setup or ensure Megatron-LM + SGLang are installed correctly in your environment
  • Design and manage a flexible Data Buffer for prompts and rollout samples to support varied data generation needs
  • Enable full model parallelism (TP/PP/DP/SP) and tune rollout settings with SGLang Router for throughput
  • Use representative prompts and labels in JSONL (and optional chat format) to seed training data
  • Start with the GRPO training workflow and iteratively adjust hyperparameters like --kl-loss-coef, --rollout-batch-size, and --n-samples-per-prompt

Example Use Cases

  • Post-train GLM-4.x models using GRPO with SGLang-based rollout for RL refinement
  • Implement custom data generation pipelines with flexible buffers to support domain-specific prompts
  • Scale rollout generation across GPUs using SGLang Router for high-throughput experimentation
  • Conduct RL tuning for Qwen3 or DeepSeek V3 in a research setting with production backing (Z.ai)
  • Experiment with Megatron-LM integration (TP/PP/DP/SP) to evaluate throughput and convergence in slime

Frequently Asked Questions

Add this skill to your agents

Related Skills

sglang

Orchestra-Research/AI-Research-SKILLs

Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.

axolotl

Orchestra-Research/AI-Research-SKILLs

Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support

gptq

Orchestra-Research/AI-Research-SKILLs

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

grpo-rl-training

Orchestra-Research/AI-Research-SKILLs

Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training

fine-tuning-with-trl

Orchestra-Research/AI-Research-SKILLs

Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.

torchforge-rl-training

Orchestra-Research/AI-Research-SKILLs

Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.

Sponsor this space

Reach thousands of developers