Programming Capabilities

In this assignment, we will build an evaluation framework to measure the programming capabilities of an LLM. The benchmark that we will use is the classic OpenAI HumanEval benchmark1, and we will use it to evaluate the recently released SmolLM v2.

There are two parts to the assignment:

  1. Generating solutions with the LLM. SmolLM v2 (360M) is small enough to run on a laptop with 8GB RAM, but it is at least 10x faster to run on our GPU.

  2. Evaluating the solutions by executing Python code. We determined earlier that SmolLM v2 (360M) gets 12% of the HumanEval problems. Your goal is to reproduce this result with your own evaluation code.

You will write each step as two separate programs: the first will generate solutions and save them to disk, and the second will load solutions from disk and execute them. This approach will be more robust to failures than any attempt to do them all at once. Moreover, it will allow you to run each step on different machines.

The Dataset

The HumanEval dataset is available on the Hugging Face Hub. We will use a slightly cleaned version of the dataset from the MultiPL-E paper2. You can browse the problems with the dataset viewer and load the dataset in your code as follows:

from datasets import load_dataset

ds = load_dataset("nuprl/engineering-llm-systems", "humaneval", split="test")

Loading the Model

You can load the model and tokenizer as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

Note that this does not load the model onto the GPU. If you want to do so, you can use this variation:

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M", device="cuda")

Alternatively, you can do this on a Mac:

model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M", device="mps")

Generating Completions

Refer to the code from class to see an example of how to generate text with the model. By default, the .generate() method will only generate 20 tokens.

Task: Write code to generate completions to the HumanEval prompt and save these solutions to disk. Each completion should be have up to 300 new tokens for each prompt, sample at temperature 0.2, and generate 20 samples per prommpt. You can save data in any format you like. But, we recommend using a directory of JSON files with one file per prompt.

Recall that the model will not stop generating text when it reaches the end of the function. You should clip the generated text when it produces one of these strings: "\ndef", "\nclass", "\nif", or "\nprint".

Executing Solutions

Task: Write a program that loads the solutions from disk, executes them with their test suites, and prints the mean pass rate (also known as the pass@1 metric). You should run every program with a five second timeout. You must get a mean pass rate in the range of 10-12%.

You will need to use Python’s subprocess to run programs with a timeout.

We strongly recommend saving the execution results to disk to help you debug your code.

For the rest of this assignment, feel free to use Generative AI.

Both generating completions and executing programs can take a while. You can speed up completions by using a GPU and using batching. You can speed up executions with TheadPoolExecutor.

We strongly recommend building both programs so that they can resume after interruption and do not needlessly redo work.

Submission and Grading

You should submit a ZIP file with the following contents:

  1. completions.py: The program that generates completions.
  2. executions.py: The program that executes completions.
  3. completions/: The directory of JSON files generated by completions.py.

During grading, we will first try to compute the mean pass rate with the completions that you submit. We will run the following command:

python3 executions.py

This should print the mean pass rate on the last line. We will also attempt to generate completions with your code:

rm -rf completions
python3 completions.py
python3 executions.py

The execution environment will have an 80GB GPU, 96 CPUs, and 1TB of RAM.

  1. Mark Chen, et al. Evaluating Large Language Models Trained on Code. 2021 

  2. Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha, Michael Greenberg, Abhinav Jangda. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering (TSE), 2023