Introduction to Language Models

Imports

Here, we import the necessary modules and functions from Hugging Face transformers as well as torch and torch.nn.functional.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import torch.nn.functional as F

Loading the Model and Tokenizer

We specify the model name, load the tokenizer, and then load the model.

MODEL = "HuggingFaceTB/SmolLM2-360M"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL)

Examining Tokenization Behavior

Here, we provide some input text to the tokenizer and examine the resulting token IDs. Ignore the attention mask for now – we will get to it in a later class.

example_inputs = tokenizer("Shakespeare was a")
example_inputs
{'input_ids': [4370, 10735, 436, 253], 'attention_mask': [1, 1, 1, 1]}
tokenizer.decode(4370), tokenizer.decode(10735), tokenizer.decode(436), tokenizer.decode(253)
('Sh', 'akespeare', ' was', ' a')

In the next code blocks, we explore how the tokenizer handles alphanumeric strings and spacing.

tokenizer("i7245tf7238tvo")
{'input_ids': [89, 39, 34, 36, 37, 10543, 39, 34, 35, 40, 100, 16104], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[ tokenizer.decode(t) for t in [89, 39, 34, 36, 37, 10543, 39, 34, 35, 40, 100, 16104] ]
['i', '7', '2', '4', '5', 'tf', '7', '2', '3', '8', 't', 'vo']

We see here how the tokenizer splits up various digits and letters.


Tokenization of repeated words and spaces

We check how multiple spaces or slight spacing differences affect the tokenization.

tokenizer("hello hello")
{'input_ids': [28120, 33662], 'attention_mask': [1, 1]}
tokenizer("hello     hello")
{'input_ids': [28120, 289, 33662], 'attention_mask': [1, 1, 1]}
tokenizer("hello    hello")
{'input_ids': [28120, 333, 33662], 'attention_mask': [1, 1, 1]}
tokenizer("hello       hello")
{'input_ids': [28120, 5013, 33662], 'attention_mask': [1, 1, 1]}

Notice how the second token changes as the number of spaces changes.


More examples with punctuation and spacing

tokenizer("Shakespeare was poet! Really.")
{'input_ids': [4370, 10735, 436, 5720, 17, 27434, 30], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
[ tokenizer.decode(t) for t in [4370, 10735, 436, 5720, 17, 27434, 30] ]
['Sh', 'akespeare', ' was', ' poet', '!', ' Really', '.']
tokenizer(" Shakespeare was poet! Really.")
{'input_ids': [12081, 436, 5720, 17, 27434, 30], 'attention_mask': [1, 1, 1, 1, 1, 1]}
[ tokenizer.decode(t) for t in [12081, 436, 5720, 17, 27434, 30] ]
[' Shakespeare', ' was', ' poet', '!', ' Really', '.']

Simple Model Inference

We now feed inputs into the model to see the raw logits that come back.

inputs = tokenizer("Shakespeare was a", return_tensors="pt")
inputs
{'input_ids': tensor([[ 4370, 10735,   436,   253]]), 'attention_mask': tensor([[1, 1, 1, 1]])}
with torch.no_grad():
    outputs = model(inputs["input_ids"])
outputs
CausalLMOutputWithPast(loss=None, logits=tensor([[[ 5.6560, -1.0794, -0.9437,  ...,  4.9135,  2.1713,  4.0637],
             [12.7081, -3.2801, -3.2621,  ...,  5.3883,  7.5067,  1.3504],
             [12.6834,  0.8826,  0.9367,  ...,  9.2431,  7.9944,  3.9167],
             [ 9.1099,  1.0419,  1.3317,  ...,  7.8556,  6.6583,  3.5514]]]), past_key_values=DynamicCache(), hidden_states=None, attentions=None)

Softmax and argmax on the logits

We apply a softmax to the last token’s logits and then select the highest-probability token.

predictions = F.softmax(outputs.logits[0, -1], dim=0)
predictions.shape
torch.Size([49152])
torch.argmax(predictions)
tensor(1109)
tokenizer.decode(1109)
' great'

The model predicts ' great' as the next token for the prompt "Shakespeare was a".


Continuing the sequence

We manually extend the prompt with the token ' great' and see what the next token is.

inputs = tokenizer("Shakespeare was a great", return_tensors="pt")
with torch.no_grad():
    outputs = model(inputs["input_ids"])
predictions = F.softmax(outputs.logits[0, -1], dim=0)
torch.argmax(predictions)
tensor(6535)
tokenizer.decode(6535)
' writer'

Shapes of inputs and outputs

We look at the shapes of input_ids and the resulting logits from the model.

inputs["input_ids"].shape
torch.Size([1, 5])
outputs.logits.shape
torch.Size([1, 5, 49152])
outputs.logits
tensor([[[ 5.6560, -1.0794, -0.9437,  ...,  4.9135,  2.1713,  4.0637],
         [12.7081, -3.2801, -3.2621,  ...,  5.3883,  7.5067,  1.3504],
         [12.6834,  0.8826,  0.9367,  ...,  9.2431,  7.9944,  3.9167],
         [ 9.1099,  1.0419,  1.3317,  ...,  7.8556,  6.6583,  3.5514],
         [ 9.1243,  0.8816,  1.0765,  ..., 11.0147,  5.1538,  0.2806]]])

sing model.generate for Text Generation

Below, we use the built-in generate method to produce sequences given a prompt. By default, we specify a maximum number of new tokens and a pad_token_id to avoid warnings. We decode the resulting tokens to strings.

5.1. Greedy Generation

outputs = model.generate(
    tokenizer("Shakespeare was a", return_tensors="pt")["input_ids"],
    pad_token_id = tokenizer.eos_token_id,
    max_new_tokens=50)
tokenizer.decode(outputs[0])
'Shakespeare was a great writer, but he was also a great actor. He was a great actor because he was a great writer. He was a great writer because he was a great actor. He was a great actor because he was a great writer. He was a'

5.2. Sampling (without temperature/top-p)

When we enable do_sample=True without specifying a temperature or top-p, the model will randomly sample the next token based on the distribution.

outputs = model.generate(
    tokenizer("Shakespeare was a", return_tensors="pt")["input_ids"],
    pad_token_id = tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    temperature=None,
    top_p=None)
tokenizer.decode(outputs[0])
'Shakespeare was a man, the son of a London theatrical company manager. Yet Shakespeare is the only Shakespeare.\n\nThere is no record of Shakespeare’s birth name. It was given by his mother. However, when his father died, it was published. No'

Sampling with top_p

Here, we constrain the sampling with top_p=0.9 (nucleus sampling), which only considers the top percentage of the cumulative distribution.

outputs = model.generate(
    tokenizer("Shakespeare was a", return_tensors="pt")["input_ids"],
    pad_token_id = tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    temperature=None,
    top_p=0.9)
tokenizer.decode(outputs[0])
'Shakespeare was a man of his time. It is his view of the state of society which matters here, and he has nothing to do with what happens to the people he portrays. Shakespeare does not intend to offer an explanation or justification for the state of the world he'

Sampling with low temperature

Lower temperature (temperature=0.2) makes the model more deterministic, often repeating patterns it finds with high confidence.

outputs = model.generate(
    tokenizer("Shakespeare was a", return_tensors="pt")["input_ids"],
    pad_token_id = tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.2,
    top_p=None)
tokenizer.decode(outputs[0])
'Shakespeare was a great writer, but he was also a great actor. He was a great actor because he was a great writer. He was a great writer because he was a great actor. He was a great actor because he was a great writer. He was a'

Sampling with high temperature

A high temperature (temperature=14.0) can create more surprising or chaotic outputs.

outputs = model.generate(
    tokenizer("Shakespeare was a", return_tensors="pt")["input_ids"],
    pad_token_id = tokenizer.eos_token_id,
    max_new_tokens=50,
    do_sample=True,
    temperature=14.0,
    top_p=None)
tokenizer.decode(outputs[0])
"Shakespeare was a product of what historians usually refer as medieval 'ruled in.' He had little practical and administrative knowledge or political leadership: so what had him to say concerning 'The State House,' but that it has neither wealth. That such great, old-"

Defining a Helper Function for Generation

We create a small function generate_text to simplify the process of generating text with default arguments (do_sample=False).

def generate_text(prompt, max_new_tokens=20):
    outputs = model.generate(
        tokenizer(prompt, return_tensors="pt")["input_ids"],
        pad_token_id = tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        temperature=None,
        top_p=None)
    return tokenizer.decode(outputs[0])    

generate_text("Shakespeare was a")
'Shakespeare was a great writer, but he was also a great actor. He was a great actor because he was a'

Zero-Shot Prompting

Let’s try to get the model to tell us about the jobs of famous historical figures.

generate_text("""Tell me what job this person had.

Shakespeare:""")
'Tell me what job this person had.\n\nShakespeare: I have a friend who is a doctor.\n\nMe: What kind of doctor?\n\n'
generate_text("""Tell me what job this person had.

Shakespeare:""", max_new_tokens=1)
'Tell me what job this person had.\n\nShakespeare: I'
generate_text("""Shakespeare's job was""", max_new_tokens=10)
"Shakespeare's job was to write plays, and he did that very well"
generate_text("""Shakespeare was a""", max_new_tokens=1)
'Shakespeare was a great'
generate_text("""Shakespeare's occupation was as a""", max_new_tokens=1)
"Shakespeare's occupation was as a playwright"
generate_text("""Shakespeare's worked as a""", max_new_tokens=1)
"Shakespeare's worked as a playwright"

Recommending Places to Visit

Here we show a conversation-like context and see how the model continues the text.

out = generate_text("""The following text is a conversation between two friends, one from NYC and other from Boston.
              
NYC: Give me a list of places to see in Boston.
Boston:""", max_new_tokens=40)
print(out)
The following text is a conversation between two friends, one from NYC and other from Boston.
                  
NYC: Give me a list of places to see in Boston.
Boston: Boston is a great city. I would recommend the Boston Museum of Fine Arts, the Boston Public Library, the Boston Public Garden, the Boston Public Market, the Boston Public Aquarium, the Boston Public Library
print(generate_text("The best places to visit in Boston are"))
The best places to visit in Boston are the Boston Museum of Science, the Boston Public Library, the Boston Public Garden, the Boston Public Market

Sampling Helper Function

We define a slightly different helper function that sets do_sample=True by default, to illustrate random sampling from the model.

def generate_text_sample(prompt, max_new_tokens=20):
    outputs = model.generate(
        tokenizer(prompt, return_tensors="pt")["input_ids"],
        pad_token_id = tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=None,
        top_p=None)
    return tokenizer.decode(outputs[0])    

for _ in range(10):
    print(generate_text_sample("When you visit Boston, make sure you visit the"))
When you visit Boston, make sure you visit the oldest, historic park in the United States.

The first Native American settlement on the East Coast
When you visit Boston, make sure you visit the African American Museum. It is a phenomenal museum that depicts the African Americans’ history from the colonization of
When you visit Boston, make sure you visit the Old North Cemetery, which is located at the back of the old train station. You can see
When you visit Boston, make sure you visit the Museum of Fine Arts (MFA), which showcases American and European art up close. The MFA
When you visit Boston, make sure you visit the Faneuil Hall market, with its colorful, eclectic collection of antiques.

Fanny
When you visit Boston, make sure you visit the "Shake It Up" theatre. You'll have a good experience with this one and then you
When you visit Boston, make sure you visit the Lincoln Memorial. The most famous place in America, it bears witness to the United States civil war,
When you visit Boston, make sure you visit the Museum of Science and Industry.   It is an amazing place.   It is not a
When you visit Boston, make sure you visit the top of the roundabout in Charlestown. If you go to the southbound roundabout, turn
When you visit Boston, make sure you visit the following top 10 neighborhoods in the Boston region (10+ neighborhoods in any one Boston area

Here, each generation produces a different continuation.

generate_text("""When you visit Boston, make sure you visit the Faneuil Hall market, with its colorful, eclectic collection of antiques.

Fanny""")
"When you visit Boston, make sure you visit the Faneuil Hall market, with its colorful, eclectic collection of antiques.\n\nFanny's Farewell\n\nFanny's Farewell\n\nFanny's Farewell"

Sentiment Analysis: Zero Shot Doesn’t Work Well

We create sample reviews of a game and attempt to see how the model might classify or continue the text. These are not necessarily refined sentiment analyses, but they illustrate the model’s completion behavior when given a particular pattern or instruction.

REVIEWS = [
    """1)I know Blizzard gets a lot of hate, but personally, I don't think it gets enough. 2)During my childhood Blizzard couldn’t get it wrong. As an adult, they can’t get it right""",
    """Either you work as a laborer or you farm in Diablo 4. They're both the same.""",
    """If you had fun with previous Diablo titles, you'll enjoy this one. It's nothing new or groundbreaking, more or less the same as Diablo III with a graphics upgrade""",
    """I'm not really the target audience here, as I don't stick around with ARPGs for long, but so far it's been quite enjoyable and addicting... and also the one I've played the most.""",
    """I heard a lot of trash talk about D4, and let’s be honest here - a lot of criticism is absolutely justified, but the game is nowhere near as bad as some people paint it to be""",
    """I dont care what everyone says i loved playing the campaign."""
]

Another simple generate_text function

This is a slight redefinition of generate_text to return only the new tokens after the input prompt, for convenience.

def generate_text(prompt, max_new_tokens=20):
    inputs = tokenizer(prompt, return_tensors="pt")["input_ids"]
    outputs = model.generate(
        inputs,
        pad_token_id = tokenizer.eos_token_id,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        temperature=None,
        top_p=None)
    return tokenizer.decode(outputs[0,inputs.shape[1]:])

Quick Examples of Continuations for Sentiment-Like Prompts

generate_text("Someone said: " + REVIEWS[0] + "\nThis comment is")
" a joke.\n\nI don't think Blizzard gets enough hate. I think they get too"
generate_text("Someone said: " + REVIEWS[0] + "\nThe sentiment of this comment is")
' that Blizzard is a company that has been around for a long time and has been around for a'

Classifying positivity or negativity

We prompt the model to guess whether a review is “positive” or “negative.” It tries to continue the text accordingly, though it may not be accurate or fully aligned with typical sentiment analyzers.

for r in REVIEWS:
    print(generate_text("Someone said: " + r + "\nIf this reviewer had to classify this as positive or negative, they would choose", max_new_tokens=5))
 positive.

I
 positive.

I
 positive. It's a
 positive.

I
 negative.

The
 positive.

I
 good

(Note how the model is somewhat inconsistent.)


Few-Shot Prompting

By providing some labeled examples, we can coax the model to produce classification results more in line with them.

REVIEW_PROMPT = """
Review: tried on gamepass, and freaking love it, might as well get it on steam while its on sale.
Decision: good

Review: Game was released defunct, with Paradox and Colossal lying about the state of the game and the game play aspects.
Decision: bad

Review: Almost seven months after launch this game is still not were it is supposed to.
Decision: bad

Review: It is being improved and with time will become the greatest city builder ever.
Decision: good
"""

for r in REVIEWS:
    print(generate_text(REVIEW_PROMPT + f"\nReview:{r}\nDecision:", max_new_tokens=1))
 good
 bad
 good
 good
 good
 good

Here, after seeing the provided examples, the model leans more toward good for many of the reviews.