Capabilities of large language models (LLMs) have progressed rapidly, but the key ideas surrounding LLMs have their roots in research from years before! In fact, the training pipeline used by LLMs (pretraining -> SFT -> RLHF) was originally proposed in 2020 for training abstractive summarization models... Supervised learning for summarization: Prior to the popularization of LLMs, most summarization models were trained in a supervised fashion where we: 1. Perform self-supervised pretraining. 2. Perform supervised finetuning (SFT) over summary examples. Humans don't always write perfect summaries and summarization is a very open-ended task (i.e., one document can be summarized in many different ways that are equally valid). As such, training a summarization model to exactly match a human-written summary is a flawed approach! Learning from human feedback: Inspired by these issues, authors in [1] propose a three-part framework that enables LLMs to be finetuned based on human preference data. The LLM is first pretrained then finetuned with SFT. Afterwards, we collect data for reinforcement learning from human feedback (RLHF) as follows: 1. Take a textual input (source document) from the training dataset. 2. Use several policies (e.g., pretrained model, supervised baseline, current model, or the human reference summary) to sample summaries of the input. 3. Pick two summaries from the set of samples responses. 4. Ask a human annotator to identify the better of the two summaries. Once the data has been collected, we use this comparison data to train a reward model (using a pairwise ranking loss) that accurately predicts a human preference score given a summary produced by the LLM. From here, we use PPO (an RL algorithm) to finetune the model based on preference scores outputted by the reward model. Avoiding drift: The authors in [1] add a KL divergence term to the objective being optimized by PPO, which penalizes the policy from becoming too different from the supervised baseline policy during RLHF. Such an approach, which is now commonly used (e.g., see Eq. 4 in the LLaMA-2 report), encourages exploration without mode collapse and prevents summaries written by the LLM from becoming too different from those that are seen during training. Is learning from feedback effective? After training a 1.3B and 6.7B GPT-style LLM with this approach, we see that 1.3B human feedback model outperforms a 10× larger model trained with supervised learning alone, and the 6.7B human feedback model performs even better than the 1.3B model. These models are consistently preferred to those trained via SFT alone. From summarization to ChatGPT: The approach proposed in [1] is not just useful for summarization! This strategy for learning from human feedback was extended by InstructGPT [2] (the sister model and predecessor to ChatGPT) to align foundation language models. The training strategy used in [2] is nearly identical to the training framework proposed in [1]!
Boosting Large Language Model Results with Human Comparison Data
Explore top LinkedIn content from expert professionals.
Summary
Boosting large language model results with human comparison data refers to the process of improving how AI models understand and respond to human needs by training them with feedback from people who compare different model outputs. This approach makes language models more helpful, honest, and aligned with real-world preferences by using human judgments to guide their learning.
- Collect comparison data: Gather examples where humans rank or choose between different model responses to capture real preferences in training.
- Train with feedback: Use these human comparisons to develop a reward system that guides the model toward producing outputs people prefer.
- Balance and refine: Apply regular adjustments during training to keep the model’s responses consistent and prevent them from drifting away from familiar, reliable patterns.
-
-
If you are wondering how RLHF works, and how we can teach large language models to be helpful, harmless, and honest, read along 👇 The key isn’t just in scaling up model size, it’s in aligning models with human intent. The InstructGPT paper (2022) introduced a three-step process called Reinforcement Learning from Human Feedback (RLHF). And even today, it remains the foundation of how we build instruction-following models like ChatGPT. Let me walk you through the workflow in plain terms, based on the now-famous diagram below 👇 𝟭. 𝗦𝘂𝗽𝗲𝗿𝘃𝗶𝘀𝗲𝗱 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 (𝗦𝗙𝗧) → Start by showing the model examples of great answers to real prompts, written by humans. → These examples help the model learn how to respond: clear, direct, and grounded. → Think of this as training a junior writer by giving them a stack of perfect first drafts. → Even with a small dataset (13k samples), this creates a solid instruction-following base. 𝟮. 𝗥𝗲𝘄𝗮𝗿𝗱 𝗠𝗼𝗱𝗲𝗹 (𝗥𝗠) → Next, we collect several outputs for the same prompt and ask humans to rank them from best to worst. → We then train a separate model- the reward model, to predict those rankings. → Now, we’ve turned human preferences into a numerical score the model can optimize for. → This is the real magic: turning subjective feedback into something that can guide learning. 𝟯. 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 (𝗣𝗣𝗢) → Now the model generates new answers, gets scored by the reward model, and adjusts its behavior to maximize reward. → We use Proximal Policy Optimization (PPO), an RL algorithm that gently nudges the model in the right direction without making it forget what it already knows. → A “KL penalty” keeps it from straying too far, like a seatbelt keeping it grounded. 𝗪𝗵𝘆 𝗶𝘁 𝘄𝗼𝗿𝗸𝘀❓ ✅ A small 1.3B model trained with this pipeline outperformed GPT-3 (175B) in human evaluations. ✅ It generalized to unseen domains with little extra supervision. ✅ And it required orders of magnitude less data than pre-training. 𝗪𝗵𝗮𝘁 𝘁𝗵𝗶𝘀 𝗺𝗲𝗮𝗻𝘀 𝗳𝗼𝗿 𝗯𝘂𝗶𝗹𝗱𝗲𝗿𝘀❓ → Bigger isn’t always better. Better feedback leads to better behavior. → Pairwise comparisons are often more scalable than manual ratings. → RLHF lets us teach models values, not just vocabulary. If you're building AI systems, aligning them with human preferences isn’t just a safety concern- it’s a product strategy. --------- Share this with your network ♻️ Follow me (Aishwarya Srinivasan) for more AI insights.