Math Word Problems

In this project, we will use an LLM to solve math word problems, such as this one ¹:

Very early this morning, Elise left home in a cab headed for the hospital. Fortunately, the roads were clear, and the cab company only charged her a base price of $3, and $4 for every mile she traveled. If Elise paid a total of $23, how far is the hospital from her house?

We will explore several prompting strategies, some of which will be more effective than the others at solving math word problems. However, all the strategies are quite generic and will be broadly applicable to a wide variety of tasks. We will also use this project to introduce the OpenAI Completions API, which is a widely used API for LLMs.

Prerequisites

You will need to comfortable with text processing and regular expressions. If you are not, we recommend the following resources:

Chapter 2 of Speech and Language Processing by Dan Jurafsky and James H. Martin is an in-depth introduction to regular expressions.
The Regular Expression HOWTO by A.M. Kuchling is a gentler introduction to regular expressions in Python.

The Model

The LLM that we will use in this assignment is Meta Llama 3.1 (8B). (We are deliberately not using Llama 3.1 Instruct, which is the instruction-tuned or “chat model”. The instruction-tuned model is even more capable on math word problems. However, the techniques that we will explore are also useful when working with instruction-tuned model to solve harder problems than math word problems.) Although it is a relatively small LLM, it is very capable for its size.

On the DeltaAI cluster, the model is available at /scratch/bchk/aguha/models/llama3p1_8b_base.

The Task

Your task is to write a program that solves the math word problems in this dataset. You can load this dataset with the following code:

import datasets

test_data = datasets.load_dataset("nuprl/engineering-llm-systems", "math_word_problems", split="test")

Your program should solve it using four different techniques: zero shot prompting, few shot prompting², chain-of-thought prompting³, and program-aided language models⁴ (described below). For each technique, you should print the accuracy of the model on the test data. For example, you could print:

Zero shot accuracy: 0.10
Few shot accuracy: 0.10
Chain-of-thought accuracy: 0.65
Program-aided language models accuracy: 0.65

It is okay if your program prints some log messages, but don’t overdo it. It shouldn’t be necessary to scroll through pages of text to see the final output.

Call your program math_word_problems.py and have it take the name/path to the model on the command line. That is, you should run the program as follows:

python3 math_word_problems.py MODEL_PATH

When your program runs, it should load the model at MODEL_PATH. We may evaluate it on a different model.

Program-Aided Language Models

For the final part, we will explore program-aided language models (PAL).⁴ PAL is a variation of chain-of-thought, but instead of prompting the model to reason in natural language, we prompt the model to produce a program that solves the problem, and then run that program.

For example, here is a PAL solution to a problem:

num_pizzas = 7
slices_per_pizza = 8
total_slices = num_pizzas * slices_per_pizza
num_people = 1 + 3
slices_per_person = total_slices / num_people

When we run this program, the value of slices_per_person will be the answer.

Dynamically running code: We strongly recommend encoding each program as a function that returns the answer. If you have a string that represents a function, you can define it with the builtin exec and get its result with the builtin eval. For example:

CODE = """
def my_func():
    return 1 + 1
""".strip()

exec(CODE)
x = eval("my_func()")
assert x == 2

# If you know the name of the function, you can 
# call it without eval.
y = my_func()
assert y == 2

Target Accuracies

You should aim to reach or exceed the following accuracies:

Zero shot: 10%
Few shot: 10%
Chain-of-thought: 60%
Program-aided language models: 65%

Generative AI Guidance

You should implement a complete solution that reports zero shot and few-shot accuracy without Generative AI. You can then use Generative AI to implement chain-of-thought prompting and program-aided language models.

The word problems in this assignment are from the GSM8K (grade school mathematics) benchmark (Cobbe et al., 2021). ↩
The effectiveness of few-shot prompting was a key capability that distinguished GPT3 from GPT2 (Brown et al., 2020). ↩
Wei et al., 2022 introduced chain of thought prompting. ↩
Gao et al., 2023introduced program-aided language models. ↩ ↩²