Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a crucial technology for aligning language models with human values and intentions, enabling models to produce more helpful and harmless responses. Reward models are trained as proxies for human preferences to drive reinforcement learning optimization. While reward models are often considered central to achieving high performance, they face the following challenges in practical applications: (1) Incorrect and ambiguous preference pairs in the dataset may hinder the reward model from accurately capturing human intent. (2) Reward models trained on data from a specific distribution often struggle to generalize to examples outside that distribution and are not suitable for iterative RLHF training. In this report, we attempt to address these two issues. (1) From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. Experimental results confirm that data with varying preference strengths have different impacts on reward model performance. We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset and fully leverage high-quality preference data. (2) From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization. Furthermore, we employ meta-learning to enable the reward model to maintain the ability to differentiate subtle differences in out-of-distribution samples, and this approach can be utilized for iterative RLHF optimization. SOURCE: https://lnkd.in/gZNHMY8s CODE: https://lnkd.in/gAD26D65
RLHF Approaches for AI Alignment
Explore top LinkedIn content from expert professionals.
Summary
Reinforcement Learning from Human Feedback (RLHF) is a method to improve AI behavior by aligning machine learning models with human preferences, making them more adaptable and reliable. Recent developments in RLHF approaches emphasize refining data quality, generalizing models to diverse scenarios, and exploring alternative algorithms for better alignment.
- Focus on quality data: Use methods like preference voting to prioritize high-quality data and reduce the impact of ambiguous or inaccurate preferences.
- Utilize advanced algorithms: Combine techniques such as contrastive learning, meta-learning, or iterative methods to enhance model generalization and performance.
- Blend RLHF strategies: Consider integrating both reward-based (PPO) and reward-free (DPO) methods for comprehensive AI alignment, depending on the dataset and task requirements.
-
-
RLHF Workflow From Reward Modeling to Online RLHF We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.
-
While I was eagerly awaiting the technical report/paper accompanying the Llama 3 release yesterday, I stumbled upon another very interesting research paper this week, which finally answers one of my pressing questions: "Is DPO Superior to PPO for LLM Alignment?" RLHF is one of the main pillars of the LLM training lifecycle: pretraining -> supervised instruction-finetuning -> RLHF. In recent months, the reward-free DPO method has become one of the most widely used alternatives to the reward-based RLHF with PPO because it doesn't require training a separate reward model and is thus easier to use and implement. Consequently, most of the LLMs on top of public leaderboards have been trained with DPO rather than PPO. Unfortunately, there have not been any direct head-to-head comparisons where the same model was trained with either PPO or DPO using the same dataset until this paper came along. It's a well-written paper with lots of experiments and results, but the main takeaways are that PPO is generally better than DPO, and DPO suffers more heavily from out-of-distribution data. If you use DPO, make sure to perform supervised finetuning on the preference data first. Also, iterative DPO, which involves labeling additional data with an existing reward model, is better than DPO on the existing preference data. For PPO, the key success factors are large batch sizes, advantage normalization, and parameter updates via an exponential moving average. Fun fact: Based on what we know from the LLama 3 blog post, it doesn't have to be RLHF with either PPO or DPO. Llama 3 was trained with both! Pretraining -> supervised finetuning -> rejection sampling -> PPO -> DPO.