# /// script
# requires-python = "== 3.12"
# dependencies = [
#     "torch == 2.7",
#     "transformers == 4.55",
#     "ipywidgets",
# ]
# ///

Tokenization and Basic Prompting

Tokenization

# We need to use PyTorch directly, unless we used the highest-level APIs.
import torch
# Some simple PyTorch functions.
from torch.nn import functional as F
# Transformers has implementations for essentially every well-known LLM. We
# are going to work with what are called causal, auto-regressive, or
# decoder-only models.
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B-Base", clean_up_tokenization_spaces=False)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B-Base").to("cpu")

When we use an LLM by API, it seems to take in a string and produce a string. This is not what is going on under-the-hood. Instead, the LLM takes as input a sequence of tokens: the tokenizer turns an input string into a sequence of tokens, which are the input_ids that appear below. We can ignore the attention_mask for now. We write return_tensors="pt" to get a PyTorch tensor. Without this flag, we get the output as a Python list, which the model cannot use.

example_inputs = tokenizer("Shakespeare was a", return_tensors="pt")
example_inputs

{'input_ids': tensor([[ 2016, 36871,   572,   264]]), 'attention_mask': tensor([[1, 1, 1, 1]])}

Given the encoded string, we can decode each token:

[tokenizer.decode(2016), tokenizer.decode(36871), tokenizer.decode(572), tokenizer.decode(264)]

['Sh', 'akespeare', ' was', ' a']

Notice that the word Shakespeare is split into two subwords. Also notice each leading space is part of the word.

Here is another example:

example_inputs = tokenizer("hello hello", return_tensors="pt")
example_inputs

{'input_ids': tensor([[14990, 23811]]), 'attention_mask': tensor([[1, 1]])}

Why do we have two different token IDs for the two words hello?

Play around with more examples, e.g., common vs. uncommon names.
Explain that the tokenizer is learned.
Show that we can decode the whole sequence of course.

Inference

The input to the LLM is a sequence of tokens as we’ve seen. What’s the output? Here is how you “run” the model. The next cell is quite long.

example_inputs = tokenizer("Shakespeare was a", return_tensors="pt").to(model.device)
with torch.no_grad():
    example_outputs = model(
        **example_inputs,
        # use_cache=False suppresses a lot of output that we don't care about. 
        use_cache=False) 
example_outputs

CausalLMOutputWithPast(loss=None, logits=tensor([[[ 5.4958,  3.8857,  7.0136,  ..., -4.0584, -4.0584, -4.0584],
         [11.6809, 11.8749,  8.9903,  ..., -0.8329, -0.8329, -0.8329],
         [12.5860, 13.0843,  9.2025,  ...,  1.0495,  1.0495,  1.0495],
         [ 8.7087, 12.7536,  6.7422,  ...,  0.1486,  0.1486,  0.1486]]]), past_key_values=None, hidden_states=None, attentions=None)

example_outputs has a lot of stuff that we’ll look at later. Here is the part that matters.

example_outputs.logits[0, -1]

tensor([ 8.7087, 12.7536,  6.7422,  ...,  0.1486,  0.1486,  0.1486])

The logits are scores (high score good, low score bad). They are not probabilities (why not?).

We can turn any set of scores $u_{1:n}$ into a probability distribution as follows:

\[Pr\{X=t_i\} = \frac{b^{u_i}}{\Sigma_{j=1}^{n} b^{u_j}}\]

for any $b > 1$. Why this approach? Why not just do this?

\[Pr\{X=t_i\} = \frac{u_i}{\Sigma_{j=1}^{n} u_j}\]

We will use $b=e$, which gives us the softmax function.

prediction = F.softmax(example_outputs.logits[0,-1], dim=0)
prediction

tensor([7.2593e-07, 4.1453e-05, 1.0159e-07,  ..., 1.3909e-10, 1.3909e-10,
        1.3909e-10])

What appears above is not the next token, but the probability that any token is the next token. How many tokens are there?

len(prediction)

Let $\mathcal{T} = {t_1, \cdots t_n}$ be the set of tokens. Given a prompt $y$, the model produces a probability distribution over the next token $X$, which is $Pr{X=t_i \mid y}$ for all $t_i \in \mathcal{T}$.

To generate text, we sample from this distribution.

Actually, we don’t need to sample. The simplest approach is to pick the most likely next token.

torch.argmax(prediction)

tensor(2244)

Let’s see what word that is:

tokenizer.decode(2244)

' great'

Shakespeare was a great what?

example_inputs = tokenizer("Shakespeare was a great", return_tensors="pt").to(model.device)
with torch.no_grad():
    example_outputs = model(**example_inputs)
prediction = F.softmax(example_outputs.logits[0,-1], dim=0)
torch.argmax(prediction)

tensor(6916)

tokenizer.decode(6916)

' writer'

This is greedy decoding. We pick the most likely next-token and use that to extend the prompt. It should be clear that we can write a loop that will make the model generate several tokens of text, which is what the generate method does below.

example_inputs = tokenizer("Shakespeare was a great", return_tensors="pt").to(model.device)
example_outputs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=50)
example_outputs

tensor([[ 2016, 36871,   572,   264,  2244,  6916,    11,   714,   566,   572,
          1083,   264,  2244, 12089,    13,  1260,   572,   264,  2244, 12089,
          1576,   566,   572,  2952,   311,   990,   806, 15358,  7361,   311,
         18379,   806,  4378,    13,  1260,   572,  2952,   311,   990,   806,
         15358,  7361,   311, 18379,   806,  4378,   553,  1667,   806, 15358,
          7361,   311, 18379,   806,  4378]])

tokenizer.decode(example_outputs[0])

'Shakespeare was a great writer, but he was also a great actor. He was a great actor because he was able to use his acting skills to enhance his writing. He was able to use his acting skills to enhance his writing by using his acting skills to enhance his writing'

The model seems to be quite repetitive. This is a known failure mode of LLMs. Bigger models are less likely to exhibit neural text degeneration, but it can still occur. One way around this is to not just pick the most likely token, but to sample from the distribution of tokens. There are several different ways of doing this. Of course, it makes the model nondeterministic, but leads to a dramatic improvement in output quality.

You can run the cell below several times to get a bunch of different results. The cell is using nucleus sampling.

Recall that the model produces a distribution over the next token:

$Pr{X=t_i \mid y}$ for all $t_i \in \mathcal{T}$.

In nucleus sampling, or top-$p$ sampling, we only sample from the most likely tokens whose cumulative probability is at least $p$.

In other words, we:

Compute the smallest subset of tokens $T’ \subseteq T$ where $T’$ such that $\Sigma_{t_i \in T’}Pr{X=t_i \mid y}\ge p$.
Sample from this distribution using $Pr{X=t_i \mid y} / p$ for all $t_i \in \mathcal{T}’$.

Note that we recalculate the set $T’$ for each token.

example_inputs = tokenizer("Shakespeare was a great", return_tensors="pt").to(model.device)
example_outputs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_p=0.9,
    max_new_tokens=50)
tokenizer.decode(example_outputs[0])

"Shakespeare was a great writer, but I don't know if it's fair to call him a great performer either.  In my opinion, he was better at having fun as a performer than as a writer.  I think his stagecraft was very clever, but it"

Another common strategy is temperature based sampling (which can combined with nucleus sampling).

In temperature based sampling, we take the logits $u_{1:n}$ and compute $Pr{X=t_i}$ as follows:

\[Pr\{X=t_i\} = \frac{e^{u_i/t}}{\Sigma_{j=1}^{n} e^{u_j/t}}\]

When $t=1$, it is identical to softmax. When $t<1$, we “boost the scores” of more likely tokens.

Informally, when $t$ is lowered, we make unlikely tokens even more unlikely. So, the model gives a “more confident” but “less creative” response. Of course “increasing creativity” increases the likelihood the model will produce nonsense. We typically use values between $0.1$ and $0.8$.

example_inputs = tokenizer("Shakespeare was a great", return_tensors="pt").to(model.device)
example_outputs = model.generate(
    **example_inputs,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=500.0,
    max_new_tokens=50)
tokenizer.decode(example_outputs[0])

'Shakespeare was a great dramalist whom English-speaking Europeans regarded largely positively; their attitudes and understanding seemed (in his  times I might be brief too) similar on what were not at, now, being an agreed opinion, especially English attitudes which still differ: "English view'

Modern LLMs: Base Models vs Chat Models

Broadly speaking, there are two kinds of LLMs: base models and instruction-tuned or conversational models.
The base model comes first: it is what is trained on massive amounts of text (primarily from the web). At this scale, the base model develops capabilities in language, code, etc. But it has shortcomings:
- It can be hard to access some of the capabilities of the base model.
- Base models are “unaligned”. In contrast, chat models can be “aligned with their creators’ values”. E.g., a chat model can be developed that refuses to answer certain questions, or to give inappropriate responses. A base model tends to pickup fairly arbitrary content from the web.
The most capable commercial models (GPT4, Claude, etc) are all aligned chat models.
We are going to start working with a base model. This is primarily because it is “more work”. For the tasks that we will start with, it is harder to get them to do what we want, so you’ll learn more.
The prompting techniques that we will study are still useful for chat models when you do “real work”. But, they are almost unnecessary for the basic tasks we will do at first.

Zero-shot Prompting

In zero-shot prompting, we try to directly elicit an answer from a model. We will make some attempts, but they aren’t going to be very successful.

def generate_text(prompt, temperature=0, max_new_tokens=20):
    example_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    generate_kwargs = {
        "pad_token_id": tokenizer.eos_token_id,
        "max_new_tokens": max_new_tokens,
    }
    if temperature > 0:
        generate_kwargs["do_sample"] = True
        generate_kwargs["temperature"] = temperature
    else:
        generate_kwargs["do_sample"] = False
    with torch.no_grad():
        example_outputs = model.generate(**example_inputs, **generate_kwargs)
    return tokenizer.decode(example_outputs[0, example_inputs["input_ids"].shape[1]:])

generate_text("""Tell me what job this famous person had.

Shakespeare:""", temperature=0)

' I am a famous person. I am a famous person. I am a famous person. I am'

generate_text("""Tell me why this person is famous

Shakespeare:""", temperature=0)

' William Shakespeare (1564-1616) was an English poet, playwright,'

Why doesn’t this work? The intuition is that base models are trained primarily on “high quality” text from the web. The web has a paucity of text in strict Q&A format. What it has are reference sources: wikipedia, newspapers, etc. So, let’s try to rephrase our prompt to look more like a factual statement.

generate_text("""Shakespeare worked as a""", temperature=0, max_new_tokens=3)

' playwright, actor'

We can try several variations of the prompt, and try different famous names. We can also try other kinds of queries, e.g., getting travel recommendations.

for _ in range(10):
    print(generate_text("""When you go to Boston, be sure to visit the""", temperature=0.8))

 University of Massachusetts Boston. Visit the Museum of Science. There are many different exhibits and hands-on activities
 Boston Museum of Fine Arts, a major art institution located in the heart of historic Boston. Admission is
 famous Market Street Market. At one time, the Market was a huge marketplace. Now, it is
 4th of July celebration, the annual Boston Tea Party protest. What is the name of the
 Basilica of Saints Peter and Paul. This magnificent Roman Catholic cathedral features stunning stained glass windows that you
 Museum of Science and Industry. The museum contains interactive exhibits and a large building containing science exhibits and activities
 Massachusetts State House, also known as the State House of Representatives in Boston. The main building is made
 World Trade Center and Pentagon. Not only are they iconic buildings, but they're also home to the
 famous 18-hole golf course at the University of Massachusetts Amherst. It takes about 
 historic center, or Commonwealth Voucher. This is a free 10 percent voucher for every credit

However, there are limitations to what we can do with zero-shot. Consider the following task. We want a model to read product reviews and determine their sentiment. Here are reviews of a recent video game that received mixed reviews.

REVIEWS = [
    """1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right""",
    """Either you work as a laborer or you farm in Diablo 4. They're both the same.""",
    """If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade""",
    """I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.""",
    """I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be""",
    """I dont care what everyone says i loved playing the campaign."""
]

Let’s try to construct a prompt. (Nothing really works.)

generate_text(f"Tell me if this is a good or a bad review for the game:\n\n{REVIEWS[0]}\n\nDecision:", temperature=0)

' I think this is a good review for the game. The reviewer acknowledges that Blizzard has a reputation for'

generate_text(f"Tell me if this is a good or a bad review for the game:\n\n{REVIEWS[-1]}\n\nDecision:", temperature=0)

' I would recommend this game to people who want to play a campaign game. I would not recommend it'

generate_text(f"Tell me if this is a good or a bad review for the game:\n\n{REVIEWS[-1]}\n\nDecision (good/bad):", temperature=0)

' Good\nI really enjoyed the campaign. I was able to get to know the characters and the world'

Few-Shot Prompting

The main idea is to give the model a few examples of inputs and outputs.

REVIEW_PROMPT = """
Review: tried on gamepass, and freaking love it, might as well get it on steam while its on sale.
Decision: good

Review: Game was released defunct, with Paradox and Colossal lying about the state of the game and the game play aspects.
Decision: bad

Review: It is not being improved despite promises by Paradox.
Decision: bad

Review: Almost seven months after launch this game is still not were it is supposed to.
Decision: bad
"""

for r in REVIEWS:
    d = generate_text(REVIEW_PROMPT + f"\nReview:{r}\nDecision:", temperature=0, max_new_tokens=3)
    print(f"Review: {r}\nDecision: {d}")

Review: 1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right
Decision:  bad

Review
Review: Either you work as a laborer or you farm in Diablo 4. They're both the same.
Decision:  bad

Review
Review: If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade
Decision:  good

Review
Review: I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.
Decision:  good

Review
Review: I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be
Decision:  good

Review
Review: I dont care what everyone says i loved playing the campaign.
Decision:  good

Review

The LLM picks up the patterns in the data: (1) I want my answers to be good/bad. (2) They are analyzing the sentiment of the review. We don’t even have to write good/bad. We can write X/Y as below. (But, don’t do this. It is less reliable.)

REVIEW_PROMPT = """
Review: tried on gamepass, and freaking love it, might as well get it on steam while its on sale.
Decision: X

Review: Game was released defunct, with Paradox and Colossal lying about the state of the game and the game play aspects.
Decision: Y

Review: It is not being improved despite promises by Paradox.
Decision: Y

Review: Almost seven months after launch this game is still not were it is supposed to.
Decision: Y"""

for r in REVIEWS:
    d = generate_text(REVIEW_PROMPT + f"\nReview:{r}\nDecision:", temperature=0, max_new_tokens=1)
    print(f"Review: {r}\nDecision: {d}")

Review: 1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right
Decision:  Y
Review: Either you work as a laborer or you farm in Diablo 4. They're both the same.
Decision:  Y
Review: If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade
Decision:  Y
Review: I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.
Decision:  Y
Review: I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be
Decision:  Y
Review: I dont care what everyone says i loved playing the campaign.
Decision:  X

It may seem that few-shot is always better than zero-shot, but this isn’t the case. Look at the few-shot prompt below. It’s straightforward. But, the model gets the decision wrong for what seems trivial. Why?

Natural Language to Code: Zero Shot

If you train a model on source code, you can solve the natural language to code task remarkably well, because code tends to be interleaved with documentation.

Key idea: make the prompt “look like code”. Python code starts with def f, so let’s just do that.

result = generate_text("""def factorial(""", temperature=0, max_new_tokens=150)
print(result)

 n ):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

def main():
    n = int(input("Enter a number: "))
    print("Factorial of", n, "is", factorial(n))

if __name__ == "__main__":
    main()<|endoftext|>

Stopping

Notice that this time, the model kept on going and generated the main function. It doesn’t know when to stop, it’s just freely generating text. We need to think about how to stop generation. The typical approach is to specify stop sequences that are sequences of characters that typically denote the end of the output.

If we were in a curly-brace language, such as JavaScript, a top-level function typically ends with "\n}". It is a little harder with Python, since Python relies on indentation. Instead, we have to think about how code after a function may begin, which is what we do below. It’s not a complete list. This is definitely a hack, but it is what we do in practice.

result = generate_text("""
def factorial(""",
    temperature=0, max_new_tokens=200)
result = result.split("\n\n")[0]
print(result)

 n ):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)