More Prompting Techniques

Setup

from openai import OpenAI
import datasets
import textwrap
from collections import namedtuple
from tqdm.auto import tqdm
import os

I have set the following environment variables in ~/.zshrc (macOS) or ~/.bashrc (Linux / WSL).

BASE_URL = os.getenv("CS4973_BASE_URL")
API_KEY=api_key=os.getenv("CS4973_API_KEY")
assert BASE_URL is not None
assert API_KEY is not None

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

def llama3(prompt, **kwargs):
    response = client.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B",
        prompt=prompt,
        **kwargs)
    return response.choices[0].text

Loading Datasets

You should get used to the Hugging Face Datasets library. It is widely used for public benchmark problems, and for proprietary datasets. We are going to use it load some problems from BIG-Bench Hard (Suzgun et al, 2019).

The code below loads the "maveriq/bigbenchhard" dataset and select the "reasoning_about_colored_objects" configuration and the `“train” split within it.

bbh = datasets.load_dataset("maveriq/bigbenchhard", "reasoning_about_colored_objects", split="train")
bbh
Dataset({
    features: ['input', 'target'],
    num_rows: 250
})

Let’s look at one of the problems.

The following function makes each item a little easier to read.

def inspect_bbh(item):
    txt, options = item["input"].split("Options:", maxsplit=1)
    txt = textwrap.fill(txt, width=80)
    for opt in options.split("\n"):
        if item["target"] in opt:
            txt += f"\n\nAnswer: {opt}"
            break

    if "thought" in item:
        txt += "\nThought: " + textwrap.fill(item["thought"], width=80)
    return txt

print(inspect_bbh(bbh[100]))
On the desk, you see several things arranged in a row: a burgundy bracelet, a
grey mug, a green necklace, and a magenta textbook. What is the color of the
thing directly to the left of the necklace?

Answer: (P) grey

The BBH problems are quite hard. Llama3.1-8B-Base doesn’t do very when using zero-shot prompting. We’ll try once, and also try to write an evaluation loop that saves the failures in zero_shot_failures. Approach is to write these functions: prompt_zero_shot, extract_zero_shot, solve_zero_shot, and accuracy_zero_shot.

bbh[0]
{'input': 'On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, three grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor, how many grey objects remain on it?\nOptions:\n(A) zero\n(B) one\n(C) two\n(D) three\n(E) four\n(F) five\n(G) six\n(H) seven\n(I) eight\n(J) nine\n(K) ten\n(L) eleven\n(M) twelve\n(N) thirteen\n(O) fourteen\n(P) fifteen\n(Q) sixteen',
 'target': '(D)'}

Chain-of-thought Prompting

In chain-of-thought prompting, we construct a few-shot prompt, where the few-shot examples include an example of how one might reason through the problem. We do so below, using the reasoning steps from above. Notice how we format the prompt to include both the reasoning steps and an answer that we can extract.

COT_PROMPT = """
Input: On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, four grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the notebooks from the floor,  how many grey objects remain on it?
Options: 
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six
(H) seven
(I) eight
(J) nine
(K) ten
(L) eleven
(M) twelve
(N) thirteen
(O) fourteen
(P) fifteen
(Q) sixteen

Reasoning: There are four grey notebooks and three grey cat toys, which is seven grey objects.  There are two mauve notebooks and four grey notebooks. If I remove all the notebooks from the floor, I remove four grey notebooks, which gives me seven minus four = three grey cat toys on the floor.

Answer: (D) three

Input: On the desk, you see a set of things arranged in a row: a grey cup, a purple mug, and a blue teddy bear. What is the color of the thing directly to the right of the cup?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink

Reasoning: The purple mug is directly to the right of the cup. The list is arranged from left to right.

Answer: (Q) purple

Input: On the nightstand, you see a set of items arranged in a row: a gold plate, a silver stress ball, a fuchsia notebook, a mauve bracelet, a green jug, and a yellow fidget spinner. What is the color of the item directly to the left of the jug?
Options:
(A) red
(B) orange
(C) yellow
(D) green
(E) blue
(F) brown
(G) magenta
(H) fuchsia
(I) mauve
(J) teal
(K) turquoise
(L) burgundy
(M) silver
(N) gold
(O) black
(P) grey
(Q) purple
(R) pink

Reasoning: The list is arranged from left to right. The mauve bracelet appears immediately before the green jug.

Answer: (I) mauve

Input: On the floor, there is one mauve cat toy, two purple cat toys, three grey cat toys, two mauve notebooks, four grey notebooks, three burgundy cat toys, and one purple notebook. If I remove all the mugs from the floor,  how many grey objects remain on it?
Options: 
(A) zero
(B) one
(C) two
(D) three
(E) four
(F) five
(G) six
(H) seven
(I) eight
(J) nine
(K) ten
(L) eleven
(M) twelve
(N) thirteen
(O) fourteen
(P) fifteen
(Q) sixteen

Reasoning: There are four grey notebooks and three grey cat toys, which is seven grey objects. There are zero mugs on the floor. Thus removing mugs leaves  four plus three = seven gray objects.

Answer: (H) seven
""".strip()


def prompt_cot(item):
    return COT_PROMPT + "\n\nInput: " + item["input"].strip() + "\n\nReasoning:"

def extract_cot(response: str):
    items = response.split("Answer: ")
    if len(items) < 2:
        return "(Z)"
    return items[1].split(" ")[0]

print(llama3(prompt_cot(bbh[3]), temperature=0, max_tokens=200, stop=["Input:"]))

 The list is arranged from left to right. The red jug is the second item in the list. The red jug is followed by a gold puzzle, a burgundy bracelet, and a green notebook. There are four items to the right of the red jug.

Answer: (E) four
llama3(prompt_cot(bbh[30]), temperature=0, max_tokens=200, stop=["\nInput:"])
' The list is arranged from left to right. The mauve sheet of paper is the left-most item.\n\nAnswer: (I) mauve\n'
def solve_cot(item):
    raw_response = llama3(prompt_cot(item), temperature=0, max_tokens=150, stop=["Input:"])
    response = extract_cot(raw_response)
    if item["target"] == response:
        return True, raw_response
    else:
        print(f"Expected {item['target']} got {response}")
        return False, raw_response
    

def accuracy_cot(items):
    num_correct = 0
    failures = [ ]
    for item in tqdm(items):
        result, thought = solve_cot(item)
        if result:
            num_correct += 1
        else:
            failures.append({ "thought": thought, **item })

    return (num_correct / len(items), failures)
bbh_mini = bbh.shuffle().select(range(20))

accuracy, failures = accuracy_cot(bbh_mini)
  0%|          | 0/20 [00:00<?, ?it/s]


Expected (A) got (B)
Expected (B) got (P)
Expected (E) got (A)
Expected (L) got (M)
Expected (C) got (D)
Expected (A) got (I)

accuracy1, failures1 = accuracy_cot(bbh_mini)
  0%|          | 0/20 [00:00<?, ?it/s]


Expected (C) got (D)
Expected (B) got (P)
Expected (E) got (C)
Expected (L) got (I)
Expected (C) got (D)
Expected (A) got (I)
print(inspect_bbh(failures[0]))
On the floor, there is one magenta scrunchiephone charger and three grey
pencils. If I remove all the pencils from the floor, how many burgundy items
remain on it?

Answer: (A) zero
Thought:  There are three grey pencils and one magenta scrunchiephone charger. If I
remove all the pencils from the floor, I remove three grey pencils, which leaves
me with one magenta scrunchiephone charger.  Answer: (B) one
print(inspect_bbh(failures1[5]))
On the nightstand, you see several objects arranged in a row: a blue pencil, a
red keychain, a black teddy bear, a brown necklace, a magenta mug, and a mauve
cat toy. What is the color of the object directly to the right of the pencil?

Answer: (A) red
Thought:  The list is arranged from left to right. The mauve cat toy appears immediately
after the blue pencil.  Answer: (I) mauve

accuracy2, failures2 = accuracy_cot(bbh)
accuracy2
  0%|          | 0/250 [00:00<?, ?it/s]


Expected (D) got (E)
Expected (B) got (I)
Expected (F) got (G)
Expected (L) got (E)
Expected (C) got (Z)
Expected (D) got (E)
Expected (E) got (G)
Expected (D) got (E)
Expected (D) got (E)
Expected (K) got (L)
Expected (B) got (A)
Expected (E) got (I)
Expected (B) got (E)
Expected (B) got (A)
Expected (A) got (K)
Expected (C) got (D)
Expected (F) got (E)
Expected (D) got (P)
Expected (D) got (B)
Expected (M) got (R)
Expected (B) got (Z)
Expected (L) got (M)
Expected (E) got (D)
Expected (D) got (K)
Expected (F) got (B)
Expected (H) got (E)
Expected (A) got (I)
Expected (B) got (C)
Expected (A) got (E)
Expected (D) got (J)
Expected (E) got (A)
Expected (D) got (C)
Expected (A) got (D)
Expected (O) got (K)
Expected (G) got (F)
Expected (A) got (C)
Expected (A) got (G)
Expected (B) got (C)
Expected (C) got (E)
Expected (B) got (R)
Expected (G) got (K)
Expected (B) got (A)
Expected (C) got (B)
Expected (A) got (D)
Expected (D) got (C)
Expected (A) got (K)
Expected (I) got (K)
Expected (F) got (B)
Expected (D) got (E)
Expected (G) got (D)
Expected (B) got (A)
Expected (D) got (C)
Expected (A) got (G)
Expected (F) got (I)
Expected (A) got (C)
Expected (R) got (Q)
Expected (F) got (P)
Expected (B) got (E)
Expected (D) got (E)
Expected (A) got (C)
Expected (R) got (J)
Expected (P) got (K)
Expected (G) got (C)
Expected (L) got (D)
Expected (L) got (I)
Expected (B) got (D)
Expected (B) got (C)
Expected (C) got (Z)
Expected (D) got (C)
Expected (C) got (D)
Expected (B) got (A)
Expected (F) got (E)
Expected (G) got (D)
Expected (F) got (Z)
Expected (J) got (Q)
Expected (C) got (E)
Expected (F) got (D)
Expected (R) got (K)
Expected (D) got (E)
Expected (A) got (I)
Expected (E) got (D)
Expected (B) got (C)
Expected (G) got (P)
Expected (O) got (I)
Expected (C) got (N)
Expected (N) got (L)
Expected (B) got (C)
Expected (D) got (G)
Expected (E) got (C)
Expected (A) got (K)
Expected (H) got (C)
Expected (F) got (C)
Expected (N) got (M)
Expected (A) got (D)
Expected (B) got (C)
Expected (F) got (E)
Expected (O) got (R)
Expected (B) got (C)
Expected (I) got (G)
Expected (A) got (E)
Expected (A) got (C)
Expected (B) got (A)
Expected (B) got (C)
Expected (F) got (E)
Expected (F) got (E)
Expected (Q) got (H)
Expected (A) got (C)
Expected (A) got (C)
Expected (B) got (P)
Expected (E) got (Z)
Expected (E) got (I)





0.556
cot_failures = [ ]
cot_successes = [ ]