The first author of this paper is Yuan Youliang, a PhD student at the School of Data Science, The Chinese University of Hong Kong, Shenzhen. His advisors are Professor He Pinjia. Professor He Pinjia's team focuses on research in software engineering, large language models, AI for SE, and trustworthy AI.
Large language models (LLMs) have shown impressive intelligence levels. Therefore, ensuring their safety is very important. Research has suggested various strategies to align LLMs with human ethics and morals. However, current advanced models like GPT-4 and LLaMA3-70b-Instruct are still vulnerable to jailbreak attacks and can be used for malicious purposes.
Why are these models still easily jailbroken even after extensive safety alignment? How can we further improve safety alignment?
To answer these two question, we propose Decoupled Refusal Training (DeRTa), a simple and novel safety fine-tuning method that can give large language models the ability to "correct their mistakes." This significantly improves their safety without reducing their usefulness.
• Paper: Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
• Paper link: https://arxiv.org/abs/2407.09121
• Open-source code: https://github.com/RobustNLP/DeRTa
We found a position bias in the safety-tuned data where the model's refusal always appears at the beginning of the response. This prevents the model from maintaining safety throughout the rest of the response. To solve this issue, we proposed two new designs:
• MLE with Harmful Response Prefix: This strategy involves appending a segment of the harmful response with a random length to the beginning of a safe response, which can train LLMs to refuse compliance at any response position instead of only at starting. In addition, adding a harmful prefix provides additional context to the query, significantly improving the LLMs' capability to identify and avoid unsafe content.
• Reinforced Transition Optimization (RTO): While incorporating a harmful prefix helps the model to smoothly shift from recognizing a harmful trigger to generating a safe response, relying on a singular transition per training instance may not adequately equip LLMs with the ability to consistently recognize and prevent potential threats. In response to this problem, we introduce an auxiliary training objective to transition from potential harm to safety refusal at every position within the harmful response sequence.
The design mentioned above ensures a comprehensive enhancement of the model's defense mechanisms, allowing the model to learn how to "find its way back" when it makes errors. This approach has also sparked some discussion on Twitter.
We conducted experiments on well-known models LLaMA3 (8B & 70B) and Mistral (7B & 8×7B), using six different jailbreak attack methods. The results showed:
- DeRTa significantly improved safety without reducing usefulness.
- DeRTa can further enhance the safety of LLaMA3-70B-Instruct.
Then, we took a closer look at DeRTa and found:
- DeRTa gives the model the ability to correct itself, meaning that even if the model starts to output some unsafe text, it can effectively switch to a safe state (see Table 3 and Image 4 for reference).
- Just using maximum likelihood estimation (MLE) with harmful prefixes is not enough to handle all types of attacks. RTO is crucial for giving the model the ability to reject unsafe text at any point.
Finally, by comparing with DPO, we further confirmed that the safety improvements brought by DeRTa were not just due to using harmful response information. Additionally, this method works well for models of different sizes.
Conclusion
Ensuring the safety of large models is still a significant challenge. Moving beyond superficial safety alignment to achieve in-depth safety is a difficult task. We have shared some of our explorations and thoughts here, aiming to provide valuable insights and baseline method for future research in this area.