Students
Teachers
CUHK-Shenzhen
简体中文
Email Campus Network VPN Teaching system Campus Card CUHK(SZ) Account Software Registry Panopto
Students
Teachers
CUHK-Shenzhen
简体中文
  • About Us
    • Meet ITSO
    • Service Desk
    • Rules and Regulations
      • Policies and Regulations
      • IT Policies
    • Multifunction Classrooms and Data Center
  • Our Services
    • My Portal
    • Campus Network
    • Desktop Applications
    • Account and Permissions
  • Network and Information Security
    • Security Policies
    • Critical Data Protection
    • Phishing Alert
    • Security Tips
    • Password Security
    • Security Skills
  • High Performance Computing
    • About the Platform
    • Platform Resources
      • Hardware Resource
      • Software Resources
    • User Guide
      • Cluster Regulations and Guidelines
      • Other Regulations and Policies
      • User Manual
      • Training Videos
    • Pricing Scheme
      • Cluster hourly rates
      • Storage space rates
    • Case Study
  • Software
  • FAQ
    • Multimedia Equipment
    • Campus Card
    • Campus Network
    • Cloud Printing
    • Questionnaire Platform
    • Software
    Current Location:
  • Home
  • High Performance Computing
  • Case Study
  • Professor Pinjia He's Team Identifies and Alleviates the Refusal Position Bias in Large Language Model's Safety - provided by Youliang Yuan (2024)
High Performance Computing
About the Platform
Platform Resources
  • Hardware Resource
  • Software Resources
User Guide
  • Cluster Regulations and Guidelines
  • Other Regulations and Policies
  • User Manual
  • Training Videos
Pricing Scheme
  • Cluster hourly rates
  • Storage space rates
Case Study
Professor Pinjia He's Team Identifies and Alleviates the Refusal Position Bias in Large Language Model's Safety - provided by Youliang Yuan (2024)

The first author of this paper is Yuan Youliang, a PhD student at the School of Data Science, The Chinese University of Hong Kong, Shenzhen. His advisors are Professor He Pinjia. Professor He Pinjia's team focuses on research in software engineering, large language models, AI for SE, and trustworthy AI.

Large language models (LLMs) have shown impressive intelligence levels. Therefore, ensuring their safety is very important. Research has suggested various strategies to align LLMs with human ethics and morals. However, current advanced models like GPT-4 and LLaMA3-70b-Instruct are still vulnerable to jailbreak attacks and can be used for malicious purposes.

Why are these models still easily jailbroken even after extensive safety alignment? How can we further improve safety alignment?

To answer these two question, we propose Decoupled Refusal Training (DeRTa), a simple and novel safety fine-tuning method that can give large language models the ability to "correct their mistakes." This significantly improves their safety without reducing their usefulness.

       

    •   Paper: Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
    •   Paper link: https://arxiv.org/abs/2407.09121
    •   Open-source code: https://github.com/RobustNLP/DeRTa

       

We found a position bias in the safety-tuned data where the model's refusal always appears at the beginning of the response. This prevents the model from maintaining safety throughout the rest of the response. To solve this issue, we proposed two new designs:

    •   MLE with Harmful Response Prefix: This strategy involves appending a segment of the harmful response with a random length to the beginning of a safe response, which can train LLMs to refuse compliance at any response position instead of only at starting. In addition, adding a harmful prefix provides additional context to the query,  significantly improving the LLMs' capability to identify and avoid unsafe content.
    •   Reinforced Transition Optimization (RTO): While incorporating a harmful prefix helps the model to smoothly shift from recognizing a harmful trigger to generating a safe response, relying on a singular transition per training instance may not adequately equip LLMs with the ability to consistently recognize and prevent potential threats. In response to this problem, we introduce an auxiliary training objective to transition from potential harm to safety refusal at every position within the harmful response sequence.

The design mentioned above ensures a comprehensive enhancement of the model's defense mechanisms, allowing the model to learn how to "find its way back" when it makes errors. This approach has also sparked some discussion on Twitter.

       

We conducted experiments on well-known models LLaMA3 (8B & 70B) and Mistral (7B & 8×7B), using six different jailbreak attack methods. The results showed:

  • DeRTa significantly improved safety without reducing usefulness.
  • DeRTa can further enhance the safety of LLaMA3-70B-Instruct.

           

Then, we took a closer look at DeRTa and found:

  • DeRTa gives the model the ability to correct itself, meaning that even if the model starts to output some unsafe text, it can effectively switch to a safe state (see Table 3 and Image 4 for reference).
  • Just using maximum likelihood estimation (MLE) with harmful prefixes is not enough to handle all types of attacks. RTO is crucial for giving the model the ability to reject unsafe text at any point. 

       

       

      

Finally, by comparing with DPO, we further confirmed that the safety improvements brought by DeRTa were not just due to using harmful response information. Additionally, this method works well for models of different sizes.

       

Conclusion

Ensuring the safety of large models is still a significant challenge. Moving beyond superficial safety alignment to achieve in-depth safety is a difficult task. We have shared some of our explorations and thoughts here, aiming to provide valuable insights and baseline method for future research in this area.

Copyright © CUHK-Shenzhen The Information Technology Services Office