DPO
(4 skills)AI agent skills tagged “DPO” for Claude Code, Cursor, Windsurf, and more.
axolotl
Orchestra-Research/AI-Research-SKILLs
Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA, DPO/KTO/ORPO/GRPO, multimodal support
grpo-rl-training
Orchestra-Research/AI-Research-SKILLs
Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific model training
fine-tuning-with-trl
Orchestra-Research/AI-Research-SKILLs
Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction tuning, DPO for preference alignment, PPO/GRPO for reward optimization, and reward model training. Use when need RLHF, align model with preferences, or train from human feedback. Works with HuggingFace Transformers.
openrlhf-training
Orchestra-Research/AI-Research-SKILLs
High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, RLOO, DPO training of large models (7B-70B+). Built on Ray, vLLM, ZeRO-3. 2× faster than DeepSpeedChat with distributed architecture and GPU resource sharing.