Search
Talk

Understanding and Overcoming Pitfalls in Language Model Alignment

  • Noam Razin (University Princeton)
Live Stream

Abstract

Training safe and helpful language models requires aligning them with human preferences. In this talk, I will present theory and experiments highlighting pitfalls of the two most widely adopted approaches: Reinforcement Learning from Human Feedback (RLHF), which trains a reward model based on preference data and then maximizes this reward via RL, and Direct Preference Optimization (DPO), which directly trains the language model over preference data. As detailed below, beyond diagnosing these pitfalls, I will provide quantitative measures for identifying when they occur and suggest preventative guidelines.

First, I will show that regardless of how accurate a reward model is, if it induces low reward variance, then RLHF suffers from a flat objective landscape that hinders optimization. This implies that more accurate reward models are not necessarily better teachers, in contrast to conventional wisdom, and reveals fundamental limitations of existing reward model benchmarks. Furthermore, I will present practical applications of the connection between reward variance and optimization (e.g., design of data selection and policy gradient methods) and discuss how different reward model parameterizations affect generalization. Then, we will focus on likelihood displacement — a counterintuitive tendency of DPO to decrease the probability of preferred responses (instead of increasing it as intended). I will characterize mechanisms driving likelihood displacement and demonstrate that it can unintentionally lead to unalignment by shifting probability mass from preferred safe responses to harmful responses. Our analysis brings forth a data filtering method that allows mitigating undesirable outcomes of likelihood displacement.

This talk is based on the following works.

1. Vanishing Gradients in Reinforcement Finetuning of Language Models, coauthored with Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin (ICLR 2024).

2. What Makes a Reward Model a Good Teacher? An Optimization Perspective, coauthored with Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, and Sanjeev Arora (arXiv preprint 2025).

3. Why is Your Language Model a Poor Implicit Reward Model?, coauthored with Yong Lin, Jiarui Yao, and Sanjeev Arora (arXiv preprint 2025).

4. Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization, coauthored with Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, and Boris Hanin (ICLR 2025).

Links

seminar
20.11.25 11.12.25

Math Machine Learning seminar MPI MIS + UCLA Math Machine Learning seminar MPI MIS + UCLA

MPI for Mathematics in the Sciences Live Stream

Upcoming Events of this Seminar