ReMax: A Tailored Algorithm for Efficient RLHF in LLMs
Introduction
Reinforcement Learning from Human Feedback (RLHF) is key for aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general Reinforcement Learning (RL) tasks, it is overly sophisticated for LLMs, leading to significant memory and computation costs. To make RLHF more efficient, we present a tailored algorithm called ReMax.
New Techniques in ReMax
ReMax leverages three properties of RLHF that are not fully exploited in PPO: fast simulation, deterministic transitions, and trajectory-level rewards. Building on the renowned REINFORCE algorithm, ReMax optimizes the RLHF process through the following enhancements:
1. No need for an additional value model: Unlike PPO, ReMax does not require training an additional value model, reducing computational complexity.
2. New variance reduction technique: ReMax introduces a novel variance reduction technique, further improving the algorithm's stability and efficiency.
Advantages of ReMax
ReMax offers several significant advantages over PPO:
• Simplicity of implementation: ReMax is easier to implement, reducing engineering complexity.
• Fewer hyperparameters: ReMax eliminates four hyperparameters present in PPO, making model tuning more straightforward.
• Memory efficiency: ReMax can save about 46% GPU memory compared to PPO when training a 7B model.
• Faster training: ReMax shortens training time and does not require the memory-saving offloading techniques needed by PPO, which is also 1.6 times slower.
Experiments
We conducted experiments with 4xA800-80GB GPUs. We applied ReMax to the Mistral-7B model, achieving impressive results:
• AlpacaEval leaderboard: ReMax achieved a 94.78% win rate on the AlpacaEval leaderboard, significantly outperforming current mainstream methods.
• MT-bench score: ReMax scored 7.739 on MT-bench, setting a new SOTA for open-source 7B models.
These results demonstrate the effectiveness of ReMax while addressing the limitations of PPO in LLMs, pushing forward the development of RLHF technology.
Conclusion
By optimizing the key processes of RLHF, ReMax significantly reduces memory and computational costs while improving training speed and performance. Our experimental results show that ReMax offers superior efficiency and effectiveness compared to PPO when dealing with large language models, providing a more practical option for the open-source community.