What model sizes does OpenRLHF target?

It is designed for large models in the 7B–70B+ range, using ZeRO-3 memory optimization and Ray+vLLM acceleration.

Why is it faster than DeepSpeedChat?

The architecture distributes training across nodes with Ray and leverages vLLM for accelerated inference, plus GPU resource sharing, delivering roughly 2× speedups.

What dependencies are required?

Dependencies include openrlhf, ray, vllm, torch, transformers, and deepspeed; install with the vllm extras (openrlhf[vllm]) and follow the Quick Start installation steps.

openrlhf-training

Scanned

Post-Training OpenRLHF RLHF PPO GRPO RLOO DPO Ray vLLM Distributed Training

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/openrlhf --openclaw

Files (1)

SKILL.md

8.2 KB

OpenRLHF - High-Performance RLHF Training

Quick start

OpenRLHF is a Ray-based RLHF framework optimized for distributed training with vLLM inference acceleration.

Installation:

# Launch Docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN \
  -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash

# Uninstall conflicts
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y

# Install OpenRLHF with vLLM
pip install openrlhf[vllm]

PPO Training (Hybrid Engine):

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --vllm_gpu_memory_utilization 0.5 \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-rlhf \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --gradient_checkpointing --packing_samples \
  --vllm_enable_sleep --deepspeed_enable_sleep

GRPO Training (Group Normalized Policy Optimization):

# Same command as PPO, but add:
--advantage_estimator group_norm

Common workflows

Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)

Step 1: Train reward model (DPO):

deepspeed --module openrlhf.cli.train_rm \
  --save_path ./output/llama3-8b-rm \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 1 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 9e-6 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

Step 2: PPO training:

ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --critic_num_nodes 1 --critic_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain ./output/llama3-8b-rm \
  --save_path ./output/llama3-8b-ppo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --prompt_max_len 1024 --generate_max_len 1024 \
  --zero_stage 3 --bf16 \
  --actor_learning_rate 5e-7 --critic_learning_rate 9e-6 \
  --init_kl_coef 0.01 --normalize_reward \
  --vllm_enable_sleep --deepspeed_enable_sleep

Workflow 2: GRPO training (no critic model needed)

Memory-efficient alternative to PPO:

ray job submit --address="http://127.0.0.1:8265" \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  --advantage_estimator group_norm \
  --ref_num_nodes 1 --ref_num_gpus_per_node 8 \
  --reward_num_nodes 1 --reward_num_gpus_per_node 8 \
  --actor_num_nodes 1 --actor_num_gpus_per_node 8 \
  --vllm_num_engines 4 --vllm_tensor_parallel_size 2 \
  --colocate_all_models \
  --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
  --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
  --save_path ./output/llama3-8b-grpo \
  --micro_train_batch_size 8 --train_batch_size 128 \
  --micro_rollout_batch_size 16 --rollout_batch_size 1024 \
  --max_epochs 1 --bf16 \
  --actor_learning_rate 5e-7 \
  --init_kl_coef 0.01 --use_kl_loss --kl_estimator k3 \
  --normalize_reward --no_advantage_std_norm

Key GRPO parameters:

--advantage_estimator group_norm - Enables GRPO
--use_kl_loss - KL loss from GRPO paper
--kl_estimator k3 - Loss function (k2 ≈ k1)
--no_advantage_std_norm - Disables std normalization

Workflow 3: DPO training (preference optimization)

Simpler alternative without reward model:

deepspeed --module openrlhf.cli.train_dpo \
  --save_path ./output/llama3-8b-dpo \
  --save_steps -1 --logging_steps 1 \
  --eval_steps -1 --train_batch_size 256 \
  --micro_train_batch_size 2 --pretrain meta-llama/Meta-Llama-3-8B \
  --bf16 --max_epochs 1 --max_len 8192 \
  --zero_stage 3 --learning_rate 5e-7 --beta 0.1 \
  --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
  --apply_chat_template --chosen_key chosen \
  --rejected_key rejected --flash_attn --gradient_checkpointing

When to use vs alternatives

Use OpenRLHF when:

Training large models (7B-70B+) with RL
Need vLLM inference acceleration
Want distributed architecture with Ray
Have multi-node GPU cluster
Need PPO/GRPO/RLOO/DPO in one framework

Algorithm selection:

PPO: Maximum control, best for complex rewards
GRPO: Memory-efficient, no critic needed
RLOO: Modified PPO with per-token KL
REINFORCE++: More stable than GRPO, faster than PPO
DPO: Simplest, no reward model needed

Use alternatives instead:

TRL: Single-node training, simpler API
veRL: ByteDance's framework for 671B models
DeepSpeedChat: Integrated with DeepSpeed ecosystem

Common issues

Issue: GPU OOM with large models

Disable model colocation:

# Remove --colocate_all_models flag
# Allocate separate GPUs for each model
--actor_num_gpus_per_node 8 \
--critic_num_gpus_per_node 8 \
--reward_num_gpus_per_node 8 \
--ref_num_gpus_per_node 8

Issue: DeepSpeed GPU index out of range

Set environment variable:

export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

Issue: Training instability

Use Hybrid Engine instead of async:

--colocate_all_models \
--vllm_enable_sleep \
--deepspeed_enable_sleep

Adjust KL coefficient:

--init_kl_coef 0.05  # Increase from 0.01

Issue: Slow generation during PPO

Enable vLLM acceleration:

--vllm_num_engines 4 \
--vllm_tensor_parallel_size 2 \
--vllm_gpu_memory_utilization 0.5

Advanced topics

Hybrid Engine GPU sharing: See references/hybrid-engine.md for vLLM sleep mode, DeepSpeed sleep mode, and optimal node allocation.

Algorithm comparison: See references/algorithm-comparison.md for PPO vs GRPO vs RLOO vs REINFORCE++ benchmarks and hyperparameters.

Multi-node setup: See references/multi-node-training.md for Ray cluster configuration and fault tolerance.

Custom reward functions: See references/custom-rewards.md for reinforced fine-tuning and agent RLHF.

Hardware requirements

GPU: NVIDIA A100/H100 recommended
VRAM:
- 7B model: 8× A100 40GB (Hybrid Engine)
- 70B model: 48× A100 80GB (vLLM:Actor:Critic = 1:1:1)
Multi-node: Ray cluster with InfiniBand recommended
Docker: NVIDIA PyTorch container 25.02+

Performance:

2× faster than DeepSpeedChat
vLLM inference acceleration
Hybrid Engine minimizes GPU idle time

Resources

Docs: https://github.com/OpenRLHF/OpenRLHF
Paper: https://arxiv.org/abs/2405.11143
Examples: https://github.com/OpenRLHF/OpenRLHF/tree/main/examples
Discord: Community support

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/06-post-training/openrlhf/SKILL.mdView on GitHub

Overview

OpenRLHF is a Ray-based RLHF framework accelerated by vLLM, designed for distributed training of large language models (7B–70B+). It supports PPO, GRPO, RLOO, and DPO, and is built on Ray, vLLM, and ZeRO-3, delivering about 2× speedups versus DeepSpeedChat through distributed architecture and GPU resource sharing.

How This Skill Works

OpenRLHF orchestrates multi-node RLHF workflows using Ray, leverages vLLM to accelerate inference in rollout and evaluation loops, and employs ZeRO-3 memory optimization to fit very large models. Users configure roles (actor, reward model, critic) and engine counts, while the framework coordinates data movement, gradient updates, and rollout generation across GPUs and nodes.

When to Use It

Training PPO, GRPO, RLOO, or DPO for large models (7B–70B+) where distributed training and acceleration are required.
You need higher throughput or faster turnaround than traditional RLHF stacks, leveraging Ray+vLLM to speed up training and inference.
Running end-to-end RLHF pipelines (SFT → reward model → PPO) with multi-node GPU resources and ZeRO-3 memory optimization.
Memory-constrained setups where GRPO (group-normalized policy) provides a more memory-efficient alternative to standard PPO.
Scenarios requiring multi-node orchestration, GPU resource sharing, and colocated models for reduced inter-node communication overhead.

Quick Start

Step 1: Install OpenRLHF with vLLM in a compatible environment. Example: docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:25.02-py3 bash; sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y; pip install openrlhf[vllm]
Step 2: Launch a Ray cluster and run PPO training. Example: ray start --head --node-ip-address 0.0.0.0 --num-gpus 8; ray job submit --address="http://127.0.0.1:8265" --python3 -m openrlhf.cli.train_ppo_ray [options from doc]
Step 3: To try GRPO, use the same PPO command and append --advantage_estimator group_norm; for other workflows, follow the end-to-end steps in the docs.

Best Practices

Use a true multi-node Ray cluster with a colocated setup for actor, reward model, and critic components when possible.
Tune vLLM settings (vllm_num_engines, vllm_tensor_parallel_size, vllm_gpu_memory_utilization) to balance memory and throughput.
Enable memory-saving techniques: gradient_checkpointing, packing_samples, and sequence of bf16 precision for large models.
Maintain clean environments: follow the install steps (avoid conflicting packages) and install via openrlhf[vllm].
Activate vLLM sleep and DeepSpeed sleep in mixed workloads to reduce idle GPU usage and power consumption.

Example Use Cases

PPO Training (Hybrid Engine) on 8 GPUs: use vLLM engines=4, tensor_parallel_size=2, colocate all models, bf16, and Deepspeed sleep to train OpenRLHF/Llama-3-8b-sft-mixture with reward pretraining.
GRPO Training: same PPO command with --advantage_estimator group_norm for memory-efficient policy optimization.
Train Reward Model (DPO): deepspeed --module openrlhf.cli.train_rm --save_path ./output/llama3-8b-rm --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku --gradient_checkpointing --flash_attn.
Full RLHF Pipeline (SFT → Reward Model → PPO): train RM with DPO, then run PPO with vLLM engines and 8 GPUs to generate rewards and optimize policies.
Workflow 2: GRPO training (no critic model needed) as a memory-efficient alternative to PPO for large-scale RLHF workflows.