Math Word Problems

In this project, we will use an LLM to solve math word problems, such as this one ¹:

Very early this morning, Elise left home in a cab headed for the hospital. Fortunately, the roads were clear, and the cab company only charged her a base price of $3, and $4 for every mile she traveled. If Elise paid a total of $23, how far is the hospital from her house?

We will explore several prompting strategies, some of which will be more effective than the others at solving math word problems. However, all the strategies are quite generic and will be broadly applicable to a wide variety of tasks. We will also use this project to introduce the OpenAI Completions API, which is a widely used API for LLMs.

Prerequisites

You will need to comfortable with text processing and regular expressions. If you are not, we recommend the following resources:

Chapter 2 of Speech and Language Processing by Dan Jurafsky and James H. Martin is an in-depth introduction to regular expressions.
The Regular Expression HOWTO by A.M. Kuchling is a gentler introduction to regular expressions in Python.

The Completions API

The LLM that we will use in this assignment is Meta Llama 3.1 (8B). (We are deliberately not using Llama 3.1 Instruct, which is the instruction-tuned or “chat model”. The instruction-tuned model is even more capable on math word problems. However, the techniques that we will explore are also useful when working with instruction-tuned model to solve harder problems than math word problems.) Although it is a relatively small LLM, it is very capable for its size. The following code shows you how to query it:

from openai import OpenAI

client = OpenAI(base_url=URL, api_key=KEY)

resp = client.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B",
    temperature=0.2,
    max_tokens=100,
    stop=["\n"],
    prompt="Shakespeare was"

)
print(resp.choices[0].text)

When you run this code, you may see something like this:

born in Stratford-upon-Avon, England, in 1564. His father was a glove-maker and his mother was the daughter of a landowner. Shakespeare attended grammar school in Stratford, and his education was broad and rigorous. He was well versed in literature, history, and the classics. At the age of 18, he married Anne Hathaway, who was eight years his senior. The couple had three children: Susanna, Hamnet, and Judith. Shakespeare began

You should play around with the code above and review the Completions API Reference and the Completions Guide. Note that the model that we host does not support every optional argument that the API documents. However, you can set the stop sequences, set the sampling temperature, setup nucleus sampling, and control the generation length.

Zero-Shot Prompting

A base model is not specifically designed to answer questions. All it does is complete the prompt with likely text. For example, if we prompt the model with exactly the text of the math word problem above, you may get an answer, an explanation, or even a continuation of the problem. For example, after five attempts at temperature 0.2, I got the model to produce the following hint instead of the answer:

Hint: You can use the equation 3 + 4x = 23, where x is the distance in miles.

Task 1: Your first task is to figure out how to prompt the model so that it fairly reliably produces an answer for a math word problem: an answer that is always a number. To do so, write a pair of functions that (1) take a math word problem and turns it into a prompt that elicits a direct answer from the LLM and (2) takes the LLM response, which will always be a string, and turns it into a number. The latter function should return None if the LLM does not produce a number as directed.

from typing import List, Optional

def prompt_zero_shot(problem: str) -> str:
    # Your code here

def extract_zero_shot(completion: str) -> Optional[int]:
    # Your code here

The two functions above should not use the Completions API. Instead, put them together using the following code:

def solve_zero_shot(problem: str) -> Optional[int]
    resp = client.completions.create(
        model="meta-llama/Llama-3.1-8B",
        temperature=0.2,
        prompt=prompt_zero_shot(problem)
    )
    return extract_zero_shot(resp.choices[0].text)

Task 2: For this task, you will work with a list of math word problems and their answers. For example:

EASY = [
    { 
        "problem": "I ate 2 pears and 1 apple. How many fruit did I eat?",
        "answer": 3 
    },
    {
        "problem": "I had ten chocolates but ate one. How many remain?",
        "answer": 9
    }
]

Your function should take a list of problems, such as the one above, and compute the accuracy of the LLM on that list:

def accuracy_zero_shot(problems: List[dict]) -> float:
    # Your code here

Your code must not raise an exception, no matter what the LLM returns. So, make sure you handle any exceptions raised by solve_zero_shot.

Task 3: The dataset nuprl/llm-systems-math-word-problems has 50 math word problems in its test set, and you can load it as follows:

import datasets

TEST = datasets.load_dataset("nuprl/llm-systems-math-word-problems", split="test")

print(accuracy_zero_shot(TEST))

What accuracy do you get? Try re-running accuracy_zero_shot a few times. If you’re sampling with temperature, you will see that the result can vary significantly on each run. To get a stable result, update accuracy_zero_shot to try each problem n=5 times and report mean accuracy for each attempt.

For full credit, you need to get at least 10% accuracy and a stable result. If not, you can try to improve accuracy in a few ways:

You can try to improve the prompt in prompt_zero_shot. But, your prompt must still elicit a direct answer and should not given examples. (In the next part, we will use more sophisticated prompting techniques to elicit more complex responses.)
You may find that extract_zero_shot fails when the LLM produces strings such as "$23" or "1,200" Feel free to address these.
Experiment with different generation hyperparameters, such as temperature.

Tracking Progress

When you have a long list of problems, you will find it helpful to track progress. You could print after each problem, but that will fill up your screen quickly. Alternatively, use the tqdm library to display a compact progress bar.

Few-Shot Prompting

With a zero-shot prompt, we are giving the model very limited guidance on what kind of answer we want. In fact, your zero-shot prompt was unlikely to be 100% reliable. There were probably a few problems where it did not produce a number. To address this, we’ll now explore few-shot prompting.²

Task 4: Implement the following functions:

def prompt_few_shot(problem: str) -> str:
    # Your code here

def extract_few_shot(completion: str) -> Optional[int]:
    # Your code here

def solve_few_shot(problem: str) -> Optional[int]:
    # Your code here

def accuracy_few_shot(problems: List[dict]) -> float:    
    # Your code here

The prompt that you construct should have a few example problems and answers. With your few-shot prompt, you should always get a numeric answer. However, we do not expect the accuracy to increase by very much over the zero-shot version.

Chain-of-Thought Prompting

For the rest of this assignment, feel free to use Copilot or other kinds of GenerativeAI.

We’ll now explore chain-of-thought (COT) prompting, and see that it significantly increases accuracy on the task.³ For example, consider the following problem:

Henry and 3 of his friends order 7 pizzas for lunch. Each pizza is cut into 8 slices. If Henry and his friends want to share the pizzas equally, how many slices can each of them have?

Here is one way to reason through the answer:

7 pizzas are cut into 8 slices each. Thus the total number of slices is 7 * 8 = 56. Henry and 3 friends want to share the pizza equally, so the slices are divided between 4 people. Each person gets 56 / 4 = 14 slices.

Task 5: Your task is the implement the following functions:

def prompt_cot(problem: str) -> str:
    # Your code here

def extract_cot(completion: str) -> Optional[int]:
    # Your code here

def solve_cot(problem: str) -> Optional[int]:
    # Your code here

def accuracy_cot(problems: List[dict]) -> float:    
    # Your code here

Start by writing the prompting function, which should prefix the problem with 2-3 COT examples. You’ll need to write the “thoughts” yourself. Do not use the problems from test split for the COT examples. Instead, you may use the problems from train split, or construct your own.

Given that your prompt elicits “thoughts” from the model, you will need to carefully extract the final answer in extract_cot. This will require more work than what you did in the earlier approaches. For full credit, you need to get at least 45% accuracy and a stable result.

Program-Aided Language Models

For the final part, we will explore program-aided language models (PAL).⁴ PAL is a variation of chain-of-thought, but instead of prompting the model to reason in natural language, we prompt the model to produce a program that solves the problem, and then run that program.

For example, consider the example problem in the COT part. Instead of producing natural language, we could instead produce the following program:

num_pizzas = 7
slices_per_pizza = 8
total_slices = num_pizzas * slices_per_pizza
num_people = 1 + 3
slices_per_person = total_slices / num_people

When we run this program, the value of slices_per_person will be the answer.

Dynamically running code: We strongly recommend encoding each program as a function that returns the answer. If you have a string that represents a function, you can define it with the builtin exec and get its result with the builtin eval. For example:

CODE = """
def my_func():
    return 1 + 1
""".strip()

exec(CODE)
x = eval("my_func()")
assert x == 2

# If you know the name of the function, you can 
# call it without eval.
y = my_func()
assert y == 2

Task 6: Your task is to implement PAL with the following functions:

def prompt_pal(problem: str) -> str:
    # Your code here

def extract_pal(completion: str) -> Optional[int]:
    # Your code here. Use exec and eval.

def solve_pal(problem: str) -> Optional[int]:
    # Your code here

def accuracy_pal(problems: List[dict]) -> float:    
    # Your code here

For full credit, you need to get at least 60% accuracy and a stable result.

What to Submit

You should submit two files. First, a Python file that implements all the functions above. Running this file should have no side-effects. Second, a Jupyter notebook that shows how you tested the work. This notebook should load the datasets and use the functions implemented in the first file. Make sure you save the cells’ outputs.

The word problems in this assignment are from the GSM8K (grade school mathematics) benchmark (Cobbe et al., 2021). ↩
The effectiveness of few-shot prompting was a key capability that distinguished GPT3 from GPT2 (Brown et al., 2020). ↩
Wei et al., 2022 introduced chain of thought prompting. ↩
Gao et al., 2023introduced program-aided language models. ↩