Direct Preference Optimization

In this assignment, we will train a GPT-2 model to generate text with positive sentiment using Direct Preference Optimization (DPO). More specifically, we are replicating a small subset of an experiment from Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

Prerequisites

Before proceeding, you should familiarize yourself with Direct Preference Optimization. You should read the Direct Preference Optimization: Your Language Model is Secretly a Reward Model paper and get familiar with the TRL library, especially the DPOTrainer class.

You could reuse your code from the SFT assignment, or use SFTTrainer from the TRL library to perform SFT on models.

Model and Dataset

We will be using GPT2-Large as the model to be trained.

The training dataset is stanfordnlp/imdb, which contains 50,000 movie reviews from the Internet Movie Database. It contains 25,000 labeled reviews (12,500 positive and 12,500 negative) for training and 25,000 labeled reviews for testing.

You will use the training subset to perform SFT to build a reference policy model. Then, you will generate synthetic preference pairs by sampling multiple continuations for review prompts and ranking them using a pretrained santiment classifier (siebert/sentiment-roberta-large-english). These preference pairs will be used for DPO training.

The task

Your task is to train a model using Direct Preference Optimization to generate text with positive sentiment.

Part 1: Supervised Fine-Tuning

First of all, we need to teach our model to generate higher-quality movie reviews.

Fine-tune GPT2-Large on the full IMDB training set (stanfordnlp/imdb, split="train") for one epoch.

Record your training parameters and metrics in your report.

This model serves both as the reference model ($\pi_\text{ref}$) and target model for DPO training.

Part 2: Synthetic Preference Pairs

You will generate synthetic preference pairs by sampling multiple completions for review prompts in the training set of stanfordnlp/imdb and ranking them using a pretrained santiment classifier (siebert/sentiment-roberta-large-english).

The paper sampled 25000 prefixes from the training set and generated 4 completions for each prefix. For the sake of keeping the training time reasonable, you should sample 1000 prefixes from the training set and generate 4 completions for each prefix.

For each training data you sampled, you should generate 4 completions from a prefix of length 2-8 tokens from the original prompt.

You should then rank the completions using the sentiment classifier, and use the scores to generate preference pairs. There are 4 completions for each prompt, so you can form 6 ordered (chosen, rejected) pairs from these completions (without considering ties).

The preference pairs should be in the following format:

{
    "prompt": "...",
    "chosen": "...",
    "rejected": "..."
}

Part 3: Direct Preference Optimization

Train a DPO model initialized from your SFT checkpoint.

You are only required to run DPO training 3 times, using learning rate 2e-5 and beta {0.05, 0.1, 1.0}, for 1 epoch. Feel free to try different learning rates and beta values, but please keep the total GPU hours for this assignment reasonable (e.g. less than 10 hours).

Log DPO loss, KL divergence, and other metrics of your choice in wandb.

Part 4: Evaluation

You will evaluate the performance of your DPO-trained models and reproduce the Reward vs. KL trade-off plot from the Direct Preference Optimization paper (Figure 2 Left, “IMDb Sentiment Generation”).

Step 0. Setup

Load the reference SFT model ($\pi_\text{ref}$) and the DPO model ($\pi_\theta$).
Load the sentiment classifier (siebert/sentiment-roberta-large-english).
Load 50 prompts from the IMDb test split.

Step 1. Compute Sentiment Reward

Generate 5 completions per test prompt from the reference SFT model ($\pi_\text{ref}$) and the DPO model ($\pi_\theta$).
Score each completion using your pretrained sentiment classifier (siebert/sentiment-roberta-large-english).
Compute the average positive sentiment probability across completions for each model. This value is your reward.

Return the mean reward across test prompts.

Step 2. Compute KL Divergence

Estimate the forward KL divergence between the DPO policy ($\pi_\theta$) and the reference ($\pi_{\text{ref}}$) (your SFT model):

\[\mathrm{KL}(\pi_\theta \Vert \pi_{\text{ref}}) = \mathbb{E}_{x,y\sim\pi_\theta}[\log\pi_\theta(y|x) - \log\pi_{\text{ref}}(y|x)].\]

You can approximate this numerically using the difference in log-probabilities output by the models for the generated tokens.

Return the mean KL divergence across test prompts.

Step 3. Plot Reward vs KL

Using your evaluation results, create a scatter plot replicating the style of the paper:

x-axis: $\mathrm{KL}(\pi_\theta \Vert \pi_{\text{ref}})$ – the mean KL divergence across test prompts
y-axis: Sentiment Reward (average classifier positive probability) – the mean reward across test prompts
Each point: One trained checkpoint or hyperparameter setting – one DPO model checkpoint for each beta value you trained.

Ideally, your plot should show an increasing trend (as shown in Figure 2 in the paper): as KL grows, reward initially rises sharply, then tops at around 1.0, illustrating stronger sentiment control with moderate divergence from the reference model.

We will also accept plots that show all of the points topping at around 1.0 for rewards. For an example, see plot below.

Example Plot

The whiskered crosses represent standard deviation and are not required for your implementation.

It is also acceptable if your plot does not display this trend. In that case, you should discuss possible reasons for the discrepancy between your plot and the paper.

Part 5: Report

You are to prepare a short report (report.pdf) that summarizes your findings.

There is no definite format for the report, but you should include the following information:

The organization of your code. i.e. which files are used for which part of the task?
The link to your wandb project which should include stats for all your runs, or plots of your recorded metrics in wandb. Don’t forget to make the project public!
The Reward vs KL plot you created. Discuss the trend you observed in the plot.
- If the trend is the same as the paper, discuss how your results compare to the paper in terms of the reward and KL divergence.
- If the trend in your plot seems to differ from the paper, try discuss why.
The training parameters you used for both SFT and DPO.
Discuss the results of your DPO training. Include a sample generations.

Generative AI Guidance

In this assignment, you are encouraged to use Generative AI to help you with coding part of the homework. However, you are not allowed to use it to generate the report.

Submission

You should submit your code and report to the Gradescope assignment.