Class Introduction, Tokenization and Basic Prompting

Introduction

This is not a machine learning course. This is also not a course about automated software engineering. The course is about building natural interfaces to software systems.
What do we mean by a natural language interface?
- Windows Copilot
- Apple Intelligence
These systems are powered by LLMs, and there are several more that have appeared over the past year.
We will focus on textual, natural language interfaces, but we will look at other modalities too.
In the course, we will largely use LLMs by web-based APIs. These APIs typically hide a lot of details that aren’t strictly necessary to understand.
This is not a machine learning course, but it’s still important to understand a little bit about what’s going on behind the scenes:
- When things go wrong or you’re working on a challenging problem, these details matter.
- To get a job doing this work, you need to demonstrate that you know the details. Jobs in this area are very hard to get, so you need to be exceptional: understanding what’s happening behind the scenes is a way to stand out.
Policies:
- See course website for the usual
- Generative AI policy (on website)
Homework:
- Done in pairs (your choice of teammate)
- Assume you cannot submit HW late. The next HW may use a different LLM, so the old LLM may not be available.
- Do not do HW at the last minute – the LLM server may be down/overloaded.
- HW is due at midnight. I wake up at 5am. So, if the LLM server dies at 10pm, you’re out of luck.
Grading:
- Code is graded on quality (subjective) and correctness (objective)
- Exact grade breakdown TBD

Tokenization with GPT-2

# We need to use PyTorch directly, unless we used the highest-level APIs.

import torch
# Some simple PyTorch functions.
from torch.nn import functional as F
# Transformers has implementations for essentially every well-known LLM. We
# are going to work with what are called causal, auto-regressive, or
# decoder-only models.
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2", clean_up_tokenization_spaces=False)
model = AutoModelForCausalLM.from_pretrained("gpt2")

When we use an LLM by API, it seems to take in a string and produce a string. This is not what is going on under-the-hood. Instead, the LLM takes as input a sequence of tokens: the tokenizer turns an input string into a sequence of tokens, which are the input_ids that appear below. We can ignore the attention_mask for now.

tokenizer("Shakespeare was a")

{'input_ids': [2484, 20946, 373, 257], 'attention_mask': [1, 1, 1, 1]}

Given the encoded string, we can decode each token:

tokenizer.decode(2484), tokenizer.decode(20946), tokenizer.decode(373), tokenizer.decode(257)

('Sh', 'akespeare', ' was', ' a')

Notice that the word Shakespeare is split into two subwords. Also notice each leading space is part of the word.

tokenizer("Shakespeare was a ")

{'input_ids': [9776, 373], 'attention_mask': [1, 1]}

We write return_tensors="pt" to get a PyTorch tensor. Without this flag, we get the output as a Python list, which the model cannot use.

example_inputs = tokenizer("Shakespeare was a", return_tensors="pt")
example_inputs

{'input_ids': tensor([[ 2484, 20946,   373,   257]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

Inference with GPT-2

The input to the LLM is a sequence of tokens as we’ve seen. What’s the output? Here is how you “run” the model. The next cell is quite long.

with torch.no_grad():
    example_outputs = model(**example_inputs)

example_outputs has a lot of stuff that we’ll look at later.

example_outputs

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[ -33.1163,  -32.4949,  -34.4425,  ...,  -41.5759,  -39.8088,
           -33.3394],
         [ -82.0137,  -81.2655,  -86.9767,  ...,  -89.8577,  -87.3788,
           -82.7296],
         [-111.0641, -109.2424, -115.6653,  ..., -117.6229, -114.8920,
          -113.2432],
         [-108.3586, -106.2294, -111.8572,  ..., -114.1808, -113.2533,
          -107.8810]]]), past_key_values=((tensor([[[[-1.1270e+00,  1.5919e+00,  5.9682e-01,  ..., -9.0220e-01,
           -9.1240e-01,  1.7617e+00],
          [-1.4377e+00,  2.2789e+00,  1.2692e+00,  ..., -1.3563e+00,
           -3.7444e-01,  9.8559e-01],
          [-1.8623e+00,  2.1817e+00,  2.0293e+00,  ..., -1.6566e+00,
           -3.0061e+00,  2.5634e+00],
          [-2.3682e+00,  2.7319e+00,  1.6433e+00,  ..., -4.8331e-01,
           -2.0235e+00,  2.3107e+00]],

         [[-2.0059e-01,  5.4482e-01,  3.9853e-01,  ...,  7.0816e-01,
            2.0624e+00,  1.1059e+00],
          [ 3.6042e-01, -2.3326e+00, -1.7750e+00,  ..., -6.5272e-01,
            3.8154e+00,  7.7016e-01],
          [ 8.7625e-01, -5.3654e-01, -7.4250e-03,  ..., -2.2366e+00,
            3.8148e+00,  5.8187e-01],
          [-7.0281e-01, -1.6155e+00, -3.0024e+00,  ..., -1.7344e+00,
            4.5107e+00,  2.2928e-01]],

         [[-4.1901e-01, -1.0563e-01,  5.5384e-01,  ..., -1.5588e+00,
           -1.6996e+00,  1.1117e+00],
          [ 7.6845e-01, -2.5338e+00, -5.2300e-01,  ..., -1.4298e+00,
           -1.7435e+00,  1.0226e+00],
          [ 4.2629e-01, -5.7709e-02,  2.2188e-01,  ..., -2.6890e+00,
            5.9969e-01,  2.3527e+00],
          [ 3.6015e-01,  1.1998e-01,  1.8092e-01,  ..., -3.2287e+00,
            4.0748e-01,  1.3456e+00]],

         ...,

Here is the part that matters:

F.softmax(example_outputs.logits[0,-1], dim=0)

What appears above is not the next token, but the probability that any token is the next token. This is the most probable next token:

torch.argmax(F.softmax(example_outputs.logits[0,-1], dim=0))

tensor(1049)

Let’s see what word that is:

tokenizer.decode(1049)

' great'

Shakespeare was a great what?

example_inputs2 = tokenizer("Shakespeare was a great", return_tensors="pt")
with torch.no_grad():
    example_outputs2 = model(**example_inputs2)
torch.argmax(F.softmax(example_outputs2.logits[0,-1], dim=0))

tensor(21810)

tokenizer.decode(21810)

' poet'

outs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=100
    )

tokenizer.decode(outs[0])

'Shakespeare was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer. He was a great poet, and he was a great writer'

This is greedy decoding. We pick the most likely next-token and use that to extend the prompt. It should be clear that we can write a loop that will make the model generate several tokens of text, which is what happens.

GPT-2 seems to be quite repetitive. This is a known failure mode of LLMs. Bigger models are less likely to exhibit neural text degeneration, but it can still occur. One way around this is to not just pick the most likely token, but to sample from the distribution of tokens. There are several different ways of doing this. Of course, it makes the model nondeterministic, but leads to a dramatic improvement in output quality.

You can run the cell below several times to get a bunch of different results.

outs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    )
print(tokenizer.decode(outs[0]))

Shakespeare was a popular author in all countries before the 20 th century. He was a great writer of plays, plays of novels, the novels of fiction, stories, fantasy, and stories about ancient Greece, Rome, Europe, and the Middle Ages: his works,

outs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    top_p=0.9,
    )
print(tokenizer.decode(outs[0]))

Shakespeare was a very good judge of Shakespearean literary merit," says Paul Haines, a historian at Boston University, who is interested in Shakespeare's ability to create complex narratives.

"It may be true that some of the most popular plays of Shakespeare,

outs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9,
    )
print(tokenizer.decode(outs[0]))

Shakespeare was a very different character. He had no sense of humor. He was a very tough guy. He was a guy who lived his life to the fullest. It was not a normal life, but he was a very tough guy.

On his role

Modern LLMs: Base Models vs Chat Models

Broadly speaking, there are two kinds of LLMs: base models and instruction-tuned or conversational models.
The base model comes first: it is what is trained on massive amounts of text (primarily from the web). At this scale, the base model develops capabilities in language, code, etc. But it has shortcomings:
- It can be hard to access some of the capabilities of the base model.
- Base models are “unaligned”. In contrast, chat models can be “aligned with their creators’ values”. E.g., a chat model can be developed that refuses to answer certain questions, or to give inappropriate responses. A base model tends to pickup fairly arbitrary content from the web.
The most capable commercial models (GPT4, Claude, etc) are all aligned chat models.
We are going to start working with a base model. This is primarily because it is “more work”. For the tasks that we will start with, it is harder to get them to do what we want, so you’ll learn more.
The prompting techniques that we will study are still useful for chat models when you do “real work”. But, they are almost unnecessary for the basic tasks we will do at first.

Zero-shot Prompting

from openai import OpenAI
import os

# Running on a vLLM server.
client = OpenAI(base_url=os.getenv("CS4973_BASE_URL"), api_key=os.getenv("CS4973_API_KEY"))

def llama3(prompt, **kwargs):
    response = client.completions.create(
        model="meta-llama/Meta-Llama-3.1-8B",
        prompt=prompt,
        **kwargs)
    return response.choices[0].text

In zero-shot prompting, we try to directly elicit an answer from a model. We will make some attempts, but they aren’t going to be very successful.

Let’s start with a simple example of zero-shot prompting. In this code, we’ll ask the model about Shakespeare’s occupation without providing any additional context or examples. We’ll use a low temperature setting to encourage more deterministic outputs.

output = llama3("""Tell me what job this person had:

    Shakespeare:
       """, max_tokens=30, temperature=0)
print(output)

 - wrote plays
        - wrote sonnets
        - wrote poems
        - wrote plays
        - wrote sonnets
        - wrote poems

Now, let’s try a more focused prompt to get a specific answer about Shakespeare’s occupation.

llama3("Shakespeare worked as a", max_tokens=2, temperature=0)

' playwright,'

Let’s try a different task. We’ll ask the model to recommend things to do in Boston. We first try asking directly for recommendations.

print(llama3("Give me recommendations of things to see in Boston.", max_tokens=20, temperature=0))

 I'm going to be there for a week and I want to see everything.
I'm going to

What went wrong? Instead of giving us a list of things to do, the model started to “complete our own words” (however you want to put it). Now, let’s use the same strategy that we used with occupations.

for _ in range(10):
    print(llama3("In Boston you can do things like", max_tokens=10, temperature=0.8))

 take a qat walk
They have a museum
 drink Starbucks coffee at a table inside the Boston Public
 take a ride on the Seaport Trolley
 say you have the flu to get out of jury
 visit the Freedom Trail, go to the Aquarium,
 this!
Image from Boston Marathon Map project of 
 ride a horse carriage through the streets, or go
 listen to live music in the city's own Music
 running across the finish line of the Boston marathon,
 watch the sunrise from the roof of the Prud

This is a bit better, but it is still not a list. Let’s try to get a list by explicitly prompting for a list. The idea here is that the “:” should help guide the model to produce a list.

print(llama3("List of things to do in Boston:", max_tokens=20, temperature=0.8))

 Go to a sea port, drink fresh brewed beer out of a keg, and get a glimpse

That still didn’t work, so let’s give it a little more help by providing the “ -“ delimiter for list items.

print(llama3("List of things to do in Boston:\n -", max_tokens=200, temperature=0.8))

 Check out the school
 - Look for the statue of John F. Kennedy
 - See the Boston Commons
 - Go to Newbury Street
 - Walk around the Quincy Market
 - Go to the Museum of Science
 - Check out the Freedom Trail
 - Walk along the Esplanade

```python
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

client_id = 'client_id'
client_secret = 'client_secret'
client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
spotify = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

top_tracks = spotify.current_user_top_tracks(time_range="short_term", limit=20)

tracks = []
for track in top_tracks["items"]:
    tracks.append(track["name"])

tracks_df = pd.DataFrame(tracks, columns=["Tracks"])

tracks_df.to_csv("tracks.csv")
```

```python
import pandas as pd
import spotipy
from spotipy.oauth2 import Spotify

It worked! But, we got a bunch of Python code after the list. This can happen. Think about how you might post-process the output to remove the code.

Here is an alternative approach, which is what I’d come up with earlier.

for _ in range(10):
    print(llama3("When you go to Boston, be sure to visit the", max_tokens=10, temperature=0.8))

 Tasty Burger. The owner started the business as
 Boston Public Library while you are there. I loved
 Museum of Fine Arts. Just behind the Main Entrance
 Old Corner Bookstore.
The Old Corner Bookstore
 Massachusetts Institute of Technology (MIT). This area is
 “Old State House”, located near Faneuil
 Boston Athenaeum, Haynes Communications Building and
 Boston Tea Party Ships and Museum located in the heart
 Museum of Science, and while you are there,
 Commonwealth Museum , which is part of the office of

However, there are limitations to what we can do with zero-shot. Consider the following task. We want a model to read product reviews and determine their sentiment. Here are reviews of a recent video game that received mixed reviews.

REVIEWS = [
    """1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right""",
    """Either you work as a laborer or you farm in Diablo 4. They're both the same.""",
    """If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade""",
    """I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.""",
    """I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be""",
    """I dont care what everyone says i loved playing the campaign."""
]

Here is a zero-shot attempt to determine the sentiment of the reviews.

llama3(f"Tell me if this is a good or bad review: {REVIEWS[0]}\n\nDecision:", max_tokens=10)

' Not guna bother documenting them here\nDead files'

This is a bit better, but it is still not a good answer (and it’s wrong).

llama3(f"Tell me if this is a good or bad review: {REVIEWS[0]}\n\This review is", max_tokens=10)

' constructive and informational, judging the game on its own'

Few-Shot Prompting

The idea of few-shot prompting is to provide the model with a few examples of the task we want to perform. This is a bit of an art, but it can work quite well.

llama3(REVIEW_PROMPT + "Review: " + REVIEWS[1] + "\nDecision:", max_tokens=3, stop=["\n"])

' bad'

Let’s try it on all the reviews.

REVIEW_PROMPT = """
Review: tried on gamepass, and freaking love it, might as well get it on steam while its on sale.
Decision: good

Review: Game was released defunct, with Paradox and Colossal lying about the state of the game and the game play aspects.
Decision: bad

Review: Almost seven months after launch this game is still not were it is supposed to.
Decision: bad

Review: It is being improved and with time will become the greatest city builder ever.
Decision: good
"""

for r in REVIEWS:
    d = llama3(REVIEW_PROMPT + f"\nReview:{r}\nDecision:", max_tokens=3, stop=["\n"])
    print(f"Review: {r}\nDecision: {d}")

Review: 1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right
Decision:  decent
Review: Either you work as a laborer or you farm in Diablo 4. They're both the same.
Decision:  bad
Review: If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade
Decision:  good
Review: I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.
Decision:  great
Review: I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be
Decision: good
Review: I dont care what everyone says i loved playing the campaign.
Decision:  good

Limitations of Few-Shot Prompting

It may seem that few-shot is always better than zero-shot, but this isn’t the case. Look at the few-shot prompt below. It’s straightforward. But, the model gets the decision wrong for what seems trivial. Why?

REVIEW_PROMPT = """
Review: Diablo is an awful game.
Decision: bad

Review: Stardew Valley is amazing
Decision:good

Review: SimCity is the best game of all time.
Decision: good
"""

llama3(REVIEW_PROMPT + f"\nReview: Diablo is the best game ever.\nDecision:", max_tokens=3, stop=["\n"], temperature=0)

' bad'

Intuitively, there are two rules that could be working:

good/bad is a label for the review
good/bad is a label for the game, independent of review. So, Diablo is labelled bad.

Think about how to address this.

Natural Language to Code: Zero Shot

This is the most obvious “zero-shot way to try to get the model to generate code. But, it doesn’t work very well. The model again “completes our words”.

llama3("Write a Python function to compute the maximum of three numbers.")

' Use conditional statements (if-elif-else).\n\n# Solution:\n\n# python program'

Instead, we can reformat the prompt to make it like the prefix of a Python function:

out = llama3('''
def max(a,b,c):
       """
       Computes the max of a,b,c
       """'''.strip(), max_tokens=100, stop=["\n#", "\ndef", "\nclass"])
print(out)

       # check if a = b = c
       if a == b and b == c:
           print(max(a,b,c))
           return a

       #if a is greater than either b or c
       elif a > b:
           if a > c:
               print(max(a,b,c))
               return a
           #if a is greater than a but smaller than c
           else:
               print(max(a,c,b))
               return c
             

       #if a

Note that we specifies stop strings. This is because the model is free to generate text, and we need to tell it when to stop. If we were in a curly-brace language, such as JavaScript, a top-level function typically ends with "\n}". It is a little harder with Python, since Python relies on indentation. Instead, we have to think about how code after a function may begin, which is what we do above. It’s not a complete list. This is definitely a hack, but it is what we do in practice.