Supervised Fine-Tuning: Improving a LLM’s Racket capabilities with MBRP Executions

In this assignment, we will teach a small model to write Racket programs better.

Prerequisites

Despite some of you might already know a subset of Racket, you are not expected to have any Racket experience. However, you are expected to be familiar with how to run Racket programs on the command line and check its results. See below for a primer.

Running Racket Programs

Say you give this prompt (i.e. prefix) to a model to let it generate Racket code:

#lang racket 
;; Write a program that adds two numbers a and b. 
;; a and b are entered on the same line in stdin, separated by a space.
;; print the sum to stdout.

and the model produces the following completion:

(define a (read))
(define b (read))
(displayln (+ a b))

What you need to do is to combine the prompt and the completion to a temporary racket program file (say, /tmp/out.rkt) and run the racket file using

racket /tmp/out.rkt

Check The Racket Command Line Guide for more details.

Model and Dataset

We will be using Qwen3-1.7B-Base as the model.

The dataset we will use is MBPP (Mostly Basic Python Problems), translated into Racket. We provide two datasets: mbpp-rkt-test-problems, which includes a task ID, problem description, input/output specifications, and a set of test cases (input–output pairs). This serves as the evaluation set. The training dataset, mbpp-rkt-correct-executions, consists of programming problems paired with fully correct reference solutions (e.g., code, description, input/output specifications, and test cases) where each solution has been verified to pass all tests. In other words, it provides gold-standard executable programs for training code generation models.

Both datasets are hosted in HuggingFace. You should load the dataset as follows:

import datasets

train_ds = datasets.load_dataset("nuprl/engineering-llm-systems", "mbpp-rkt-correct-executions")
test_ds = datasets.load_dataset("nuprl/engineering-llm-systems", "mbpp-rkt-test-problems")

The Task

The gist of the task is to perform Supervised Fine-Tuning (SFT) on the fine-tuning dataset.

Part 1: Evaluation on raw model

First, you should evaluate the model untrained on the test set. Feel free to reuse and modify your code from the HumanEval assignment. You should use top-p=0.95 and temperature=0.2, with 5 completions per prompt.

Part 2: Supervised Fine-Tuning

Next, you should write code that performs Supervised Fine-Tuning (SFT). In addition to what you have seen in class, you should make sure to:

Try a few different training parameters, such as learning rate (within 1e-06 and 1e-03), number of epochs or a different learning rate schedule. We ask you to try at least 2 different settings, with at least one of them using multiple epochs. However, do not try too many different settings–we ask you to keep your experiments to 10 hours of GPU time.
Log the loss, learning rate, and other metrics of your choice of the training process. You are required to have wandb setup for this training.

Note: you are not required to perform training time validation of the model, unlike the training script shown in class.

Do not forget to save the model and tokenizer after training (or at the end of each epoch) so you can use them in Part 3. For example:

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B-Base")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B-Base")

# Your training loop ...
for epoch in range(num_epochs):
    # Training code ...

    model.save_pretrained(f"sft-model-epoch{epoch}")
    tokenizer.save_pretrained(f"sft-tokenizer-epoch{epoch}")

Part 3: Evaluation

Next, you should evaluate the model trained on the test set using the same evaluation parameters as in Part 1.

You should make sure that you can produce a model that have at least pass@1 of 0.20 on the test set.

Part 4: Short report

You are to prepare a short report (report.pdf) that summarizes your findings. There is no definite format for the report, but you should include the following information:

The link to your wandb project which should include stats for all your runs. Don’t forget to make the project public!
How is the dataset preprocessed? What is the text format you are providing to the model for execution and evaluation?
What are the different parameters you have tried and what is the performance of the model on the testing set with each of them?
Which parameters did you find to be the most effective in improving the model’s performance on the testing set?
Is there any descreptency between the performance of the training dataset (i.e. loss) and the testing set?

Generative AI Guidance

You are free to use Generative AI to help you with any coding part of the task. However, you are not allowed to use it to generate the report.

Submission

You should submit your code and report to the Gradescope assignment.