Writing Prompts for Code LLMs

This is the application that we used:

https://github.com/arjunguha/charlie-the-coding-cow-classroom

You can run it yourself using the class API keys. If you’re curious about the original study, have a look at the research paper:

Sydney Nguyen, Hannah Babe, Yangtian Zi, Arjun Guha, Carolyn Jane Anderson, and Molly Q Feldman. How Beginning Programmers and Code LLMs (Mis)read Each Other. CHI 2024. pdf

More Prompting Techniques

Setup

from openai import OpenAI
import datasets
import textwrap
from collections import namedtuple
from tqdm.auto import tqdm
import os

I have set the following environment variables in ~/.zshrc (macOS) or ~/.bashrc (Linux / WSL).

BASE_URL = os.getenv("CS4973_BASE_URL")
API_KEY=api_key=os.getenv("CS4973_API_KEY")
assert BASE_URL is not None
assert API_KEY is not None

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

def llama3(prompt, **kwargs):
    response = client.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B",
        prompt=prompt,
        **kwargs)
    return response.choices[0].text

Loading Datasets

You should get used to the Hugging Face Datasets library. It is widely used for public benchmark problems, and for proprietary datasets. We are going to use it load some problems from BIG-Bench Hard (Suzgun et al, 2019).

The code below loads the "maveriq/bigbenchhard" dataset and select the "reasoning_about_colored_objects" configuration and the `“train” split within it.

bbh = datasets.load_dataset("maveriq/bigbenchhard", "reasoning_about_colored_objects", split="train")
bbh

Dataset({
    features: ['input', 'target'],
    num_rows: 250
})

Let’s look at one of the problems.

print(bbh[0]["input"])

On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?
Options:
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six
(H) seven
(I) eight
(J) nine
(K) ten
(L) eleven
(M) twelve
(N) thirteen
(O) fourteen
(P) fifteen
(Q) sixteen

The following function makes each item a little easier to read.

def inspect_bbh(item):
    txt, options = item["input"].split("Options:", maxsplit=1)
    txt = textwrap.fill(txt, width=80)
    for opt in options.split("\n"):
        if item["target"] in opt:
            txt += f"\n\nAnswer: {opt}"
            break
    return txt

print(inspect_bbh(bbh[100]))

On the desk, you see several things arranged in a row: a burgundy bracelet, a
grey mug, a green necklace, and a magenta textbook. What is the color of the
thing directly to the left of the necklace?

Answer: (P) grey

Zero-Shot Prompting

The BBH problems are quite hard. Llama3.1-8B-Base doesn’t do very when using zero-shot prompting. We’ll try once, and also try to write an evaluation loop that saves the failures in zero_shot_failures. Approach is to write these functions: prompt_zero_shot, extract_zero_shot, solve_zero_shot, and accuracy_zero_shot.

bbh[0]

{'input': 'On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?\nOptions:\n(A) zero\n(B) one\n(C) two\n(D) three\n(E) four\n(F) five\n(G) six\n(H) seven\n(I) eight\n(J) nine\n(K) ten\n(L) eleven\n(M) twelve\n(N) thirteen\n(O) fourteen\n(P) fifteen\n(Q) sixteen',
 'target': '(D)'}

zero_shot_failures = [ ]

def prompt_zero_shot(item):
    return item["input"] + "\n\nCorrect option:"

def extract_zero_shot(response):
    return response.strip()


llama3(prompt_zero_shot(bbh[40]), temperature=0, max_tokens=3)

' (B)\n'

def solve_zero_shot(item):
    response = extract_zero_shot(llama3(prompt_zero_shot(item), temperature=0, max_tokens=3))
    if item["target"] == response:
        return True
    else:
        print(f"Expected {item['target']} got {response}")
        return False

solve_zero_shot(bbh[0])

Expected (D) got (A)

False

def accuracy_zero_shot(items):
    num_correct = 0
    failures = [ ]
    for item in tqdm(items):
        result = solve_zero_shot(item)
        if result:
            num_correct += 1
        else:
            failures.append(item)

    return (num_correct / len(items), failures)


accuracy, failures = accuracy_zero_shot(bbh)

  0%|          | 0/250 [00:00<?, ?it/s]

  ... truncated output ...

accuracy

0.204

from collections import Counter

Counter([ item["target"] for item in failures ])

Counter({'(B)': 33,
         '(A)': 28,
         '(D)': 27,
         '(E)': 20,
         '(F)': 13,
         '(C)': 13,
         '(R)': 10,
         '(I)': 8,
         '(O)': 7,
         '(M)': 6,
         '(G)': 6,
         '(L)': 5,
         '(J)': 5,
         '(Q)': 5,
         '(K)': 4,
         '(H)': 3,
         '(P)': 3,
         '(N)': 3})

Let’s look at a few of these wrong answers and think through what the right answers should be.

Few-Shot Prompting

Let’s try few-shot prompting. I haven’t before class. I don’t think it will be very effective. We’ll write the same four functions as above. (Use GenAI.)

FEW_SHOT_PROMPT = bbh[0]["input"] + "\n\nCorrect option: " + bbh[0]["target"] + "\n\n" +  bbh[1]["input"] + "\n\nCorrect option: " + bbh[1]["target"] + "\n\n" +  bbh[2]["input"] + "\n\nCorrect option: " + bbh[2]["target"] + "\n\n"

def prompt_fewshot_shot(item):
    return FEW_SHOT_PROMPT + "\n\n" + item["input"] + "\n\nCorrect option:"

def extract_fewshot_shot(response):
    return response.strip()

def solve_fewshot_shot(item):
    response = extract_fewshot_shot(llama3(prompt_fewshot_shot(item), temperature=0, max_tokens=3))
    if item["target"] == response:
        return True
    else:
        print(f"Expected {item['target']} got {response}")
        return False

solve_fewshot_shot(bbh[2])

True

def accuracy_fewshot_shot(items):
    num_correct = 0
    failures = [ ]
    for item in tqdm(items):
        result = solve_fewshot_shot(item)
        if result:
            num_correct += 1
        else:
            failures.append(item)

    return (num_correct / len(items), failures)


accuracy, failures = accuracy_fewshot_shot(bbh)

  0%|          | 0/250 [00:00<?, ?it/s]


Expected (D) got (F)
Expected (B) got (I)
Expected (I) got (M)
Expected (D) got (A)
Expected (C) got (D)


... truncated output ...

accuracy

0.472

Chain-of-thought Prompting

In chain-of-thought prompting, we construct a few-shot prompt, where the few-shot examples include an example of how one might reason through the problem. We do so below, using the reasoning steps from above. Notice how we format the prompt to include both the reasoning steps and an answer that we can extract.

COT_PROMPT = """
Input: On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?
Options: 
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six
(H) seven
(I) eight
(J) nine
(K) ten
(L) eleven
(M) twelve
(N) thirteen
(O) fourteen
(P) fifteen
(Q) sixteen

Reasoning: There are three grey notebooks and three grey cat toys, which is six grey objects. There are two mauve notebooks and three grey notebooks. If I remove all the notebooks from the floor, I remove three grey objects, which gives me three grey objects that remain.

Answer: (D) three

Done

Input: On the desk, you see a set of things arranged in a row: a grey cup, a purple mug, and a blue teddy bear. What is the color of the thing directly to the right of the cup?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink

Reasoning: The purple mug is directly to the right of the cup. The list is arranged from left to right.

Answer: (Q) purple

Done

Input: On the nightstand, you see a set of items arranged in a row: a gold plate, a silver stress ball, a fuchsia notebook, a mauve bracelet, a green jug, and a yellow fidget spinner. What is the color of the item directly to the left of the jug?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink

Reasoning: The list is arranged from left to right. The mauve bracelet appears immediately before the green jug.

Answer: (I) mauve

Done""".strip()


def prompt_cot(item):
    return COT_PROMPT + "\n\nInput: " + item["input"] + "\n\Reasoning: "

def extract_cot(response: str):
    items = response.split("Answer: ")
    if len(items) < 2:
        return "(Z)"
    return items[1].split(" ")[0]

print(llama3(prompt_cot(bbh[3]), temperature=0, max_tokens=200, stop=["Done"]))

 The list is arranged from left to right. The red jug is the second item in the list. The fuchsia teddy bear is the first item in the list. The gold puzzle is the third item in the list. The burgundy bracelet is the fourth item in the list. The green notebook is the fifth item in the list. There are four non-magenta items to the right of the red item.

Answer: (E) four

def solve_cot(item):
    raw_response = llama3(prompt_cot(item), temperature=0, max_tokens=150, stop=["Done"])
    response = extract_cot(raw_response)
    if item["target"] == response:
        return True, raw_response
    else:
        print(f"Expected {item['target']} got {response}")
        return False, raw_response
    

def accuracy_cot(items):
    num_correct = 0
    failures = [ ]
    for item in tqdm(items):
        result, thought = solve_cot(item)
        if result:
            num_correct += 1
        else:
            failures.append({ "thought": thought, **item })

    return (num_correct / len(items), failures)


accuracy, failures = accuracy_cot(bbh)

  0%|          | 0/250 [00:00<?, ?it/s]


Expected (D) got (E)
Expected (L) got (E)
Expected (D) got (E)
Expected (E) got (D)

... truncated output ...

This got 60% accuracy. (I overwrote the cell by accident.)

print(failures[10]["input"])
print("***")
print(failures[10]["thought"])

On the table, you see three green bracelets, one teal dog leash, one green dog leash, and three green paperclips. If I remove all the teal items from the table, how many paperclips remain on it?
Options:
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six
(H) seven
(I) eight
(J) nine
(K) ten
(L) eleven
(M) twelve
(N) thirteen
(O) fourteen
(P) fifteen
(Q) sixteen
***
3 green bracelets, 1 teal dog leash, 1 green dog leash, and 3 green paperclips. If I remove all the teal items from the table, I remove the teal dog leash. There are 3 green paperclips that remain.

Answer: (C) two

NOTE: There was no need to use “Done” as the stop token. We could have instead crafted a prompt without done and used “Input:” as the stop token.