Students
Teachers
CUHK-Shenzhen
简体中文
Email Campus Network VPN Teaching system Campus Card CUHK(SZ) Account Software Registry Panopto
Students
Teachers
CUHK-Shenzhen
简体中文
  • About Us
    • Meet ITSO
    • Service Desk
    • Rules and Regulations
      • Policies and Regulations
      • IT Policies
    • Multifunction Classrooms and Data Center
  • Our Services
    • My Portal
    • Campus Network
    • Desktop Applications
    • Account and Permissions
  • Network and Information Security
    • Security Policies
    • Critical Data Protection
    • Phishing Alert
    • Security Tips
    • Password Security
    • Security Skills
  • High Performance Computing
    • About the Platform
    • Platform Resources
      • Hardware Resource
      • Software Resources
    • User Guide
      • Cluster User Guide
    • Pricing Scheme
      • Cluster hourly rates
      • Storage space rates
    • Case Study
  • Software
  • FAQ
    • Multimedia Equipment
    • Campus Card
    • Campus Network
    • Cloud Printing
    • Questionnaire Platform
    • Software
    Current Location:
  • Home
  • High Performance Computing
  • Case Study
  • Professor Ruoyu Sun’s Team Achieves Efficient Alignment of Large Language Models - provided by Ziniu Li (2024)
High Performance Computing
About the Platform
Platform Resources
  • Hardware Resource
  • Software Resources
User Guide
  • Cluster User Guide
  • 其他规范及办法
  • 操作手册
  • 培训视频
Pricing Scheme
  • Cluster hourly rates
  • Storage space rates
Case Study
Professor Ruoyu Sun’s Team Achieves Efficient Alignment of Large Language Models - provided by Ziniu Li (2024)

 

ReMax: A Tailored Algorithm for Efficient RLHF in LLMs

 

Introduction

Reinforcement Learning from Human Feedback (RLHF) is key for aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general Reinforcement Learning (RL) tasks, it is overly sophisticated for LLMs, leading to significant memory and computation costs. To make RLHF more efficient, we present a tailored algorithm called ReMax.

New Techniques in ReMax

ReMax leverages three properties of RLHF that are not fully exploited in PPO: fast simulation, deterministic transitions, and trajectory-level rewards. Building on the renowned REINFORCE algorithm, ReMax optimizes the RLHF process through the following enhancements:
       1.   No need for an additional value model: Unlike PPO, ReMax does not require training an additional value model, reducing computational complexity.
       2.   New variance reduction technique: ReMax introduces a novel variance reduction technique, further improving the algorithm's stability and efficiency.

Advantages of ReMax

ReMax offers several significant advantages over PPO:
       •   Simplicity of implementation: ReMax is easier to implement, reducing engineering complexity.
       •   Fewer hyperparameters: ReMax eliminates four hyperparameters present in PPO, making model tuning more straightforward.
       •   Memory efficiency: ReMax can save about 46% GPU memory compared to PPO when training a 7B model.
       •   Faster training: ReMax shortens training time and does not require the memory-saving offloading techniques needed by PPO, which is also 1.6 times slower.

 

Experiments

We conducted experiments with 4xA800-80GB GPUs. We applied ReMax to the Mistral-7B model, achieving impressive results:

       •   AlpacaEval leaderboard: ReMax achieved a 94.78% win rate on the AlpacaEval leaderboard, significantly outperforming current mainstream methods.
       •   MT-bench score: ReMax scored 7.739 on MT-bench, setting a new SOTA for open-source 7B models.

These results demonstrate the effectiveness of ReMax while addressing the limitations of PPO in LLMs, pushing forward the development of RLHF technology.

 

Conclusion

By optimizing the key processes of RLHF, ReMax significantly reduces memory and computational costs while improving training speed and performance. Our experimental results show that ReMax offers superior efficiency and effectiveness compared to PPO when dealing with large language models, providing a more practical option for the open-source community.

 

Copyright © CUHK-Shenzhen The Information Technology Services Office