# Inference on GPUs

## Instructions for Delta

You do no need to install packages or download the model yourself. We have prepared a a Python Virtual Environment that has the necessary packages. However, you need to make Jupyter aware of the environment.

To do so, from a terminal on Delta:

```bash
source /scratch/bchk/aguha/venv/bin/activate
python -m ipykernel install --user --name=bchk
```

If you're using a terminal within Jupyter, it may take a few minutes for it to appear.

When you load this notebook in Delta, ensure that *bhck* appears on the top-right corner.



## Introduction

The goal of this tutorial is to show you the basics of inference on a single
GPU. The Transformers library has a lot of convenient methods for training and
generation. We are going to avoid them and instead work directly with the neural
network. This is the only way to understand the material.

To use this notebook, you will need:

1. A Python environment with PyTorch installed. I am not going to tell you how
   to set this up, because it varies considerably across machines.
2. A GPU with 16GB+ VRAM.
3. The following extra packages, which you can install with pip:
   ```
   pip3 install transformers datasets matplotlib tqdm flash_attn accelerate
   ```

The `flash_attn` package isn't strictly necessary. But, without it, training
will be slower and you will need more GPU memory than indicated above.
If you have trouble installing it, visit the 
[Flash Attention 2](https://github.com/Dao-AILab/flash-attention) page.

We first load a couple of modules.

In [2]:
# We need to use PyTorch directly, unless we used the highest-level APIs.
import torch
# Some simple PyTorch functions.
from torch.nn import functional as F
# Transformers has implementations for essentially every well-known LLM. We
# are going to work with what are called causal, auto-regressive, or
# decoder-only models. These includes StarCoder, Llama, the GPT models, etc.
from transformers import AutoModelForCausalLM, AutoTokenizer

The following cell loads the model and tokenizer. You may need to modify
the `MODEL` variable below to load the model from a different path.

In [None]:
MODEL = "/scratch/bchk/aguha/models/llama3p1_8b_base"

# Reasonable options for DEVICE:
# - "mps" for Apple Silicon
# - "cuda" for Nvidia GPUs
# - "cpu" for CPUs
DEVICE = "cuda"

# The model and tokenizer get loaded separately. But, they are typically at the
# same location, and it never makes sense to mix-and-match tokenizers and
# models.
tokenizer = AutoTokenizer.from_pretrained(MODEL, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token 

model = AutoModelForCausalLM.from_pretrained(
    MODEL,
    torch_dtype=torch.bfloat16,
).to(device=DEVICE)

## Tokenizer Basics

The tokenizer turns an input string into a sequence of tokens, which are the
`input_ids` that appear below. We can ignore the `attention_mask` for now.
We write `return_tensors="pt"` to get a PyTorch tensor. Without this flag, we
get the output as a Python list, which the model cannot use.

In [None]:
example_inputs = tokenizer(["def hello():\n\treturn"], return_tensors="pt")
example_inputs

It's worth understanding the shape of the tensor:

In [None]:
example_inputs.input_ids.shape

It's a 2D tensor, where the first dimension is the batch and the second
dimension is the sequence length. We have a single item in the batch, so the
length of the first dimension is 1. The second dimension is 6, so the input was
split into 6 tokens.

The code below loops through the tokens and *decodes* them back into strings.

In [None]:
# We have a single item in the batch, so just
example_input = example_inputs.input_ids[0]
# For every token in the sequence
for token in example_input:
    # Without tok.item() we print tensor(n) instead of n.
    # We use __repr__ so that special characeters like newlines are printed.
    print(token.item(), "->", tokenizer.decode(token).__repr__())

**Do now:** You should try to tokenize a different input string, and run the
decoding loop to see how it gets tokenized. I suggest using your name in the
function name, e.g., `hello_arjunguha`.

## Model Basics

In this section we will directly use the model to get a sense of what it
does.

We'll use the same example input from the previous section. The model requires
the input tensors to be on the same device as the model. The `.to(model.device)`
at the end of the next line takes care of that.

In [None]:
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
print(example_inputs)

The code below runs a *forward pass* with two simplifications:
1. We disable dropout (`model.eval`)
2. We don't compute gradients (`torch.no_grad`)
We'll enable both when we get to training.

Notice that the output from the tokenizer is conveniently structured so that we
can pass it as keyword arguments to `model.forward` by writing
`model.forward(**example_inputs)`. In our case, this is equivalent to:

```python
model.forward(
    input_ids=example_inputs.input_ids,
    attention_mask=example_inputs.attention_mask
)
```

In [None]:
model.eval()
with torch.no_grad():
    example_outputs = model.forward(**example_inputs)
print(example_outputs)

The result above has a lot of optional fields, but the only one that is set is `logits`.
Let's compare its shape to the shape of the input:

In [None]:
print(example_inputs.input_ids.shape)
print(example_outputs.logits.shape)

We have one output per input token, and each output is a tensor with ~50,000
elements. Let's look at one of them.

In [None]:
example_outputs.logits[0, 1]

Each index in this tensor corresponds to a token type ID, and the
output represents the distribution over all possible tokens. But, look at the
numbers. There are plenty of negative numbers, so this is *not* a probability
distribution.

These are raw, unnormalized predictions, or scores, or *logits*. We can turn 
them into a distribution using the *softmax* function. We can also sum them to
verify that they sum to 1:

In [None]:
example_dist_single = F.softmax(example_outputs.logits[0, 0], dim=0)
print(example_dist_single)
print(example_dist_single.sum())

Let's turn every output into a distribution:

In [None]:
# Copied from above
example_inputs = tokenizer(['def hello():\n\treturn'], return_tensors="pt").to(model.device)
model.eval()
with torch.no_grad():
    example_outputs = model.forward(**example_inputs)

# Notice that we do .logits[0] and not .logits[0,0] as above.
example_dist = F.softmax(example_outputs.logits[0], dim=1)
print(example_dist.shape)

Given these distributions (one for each output), we can produce the most likely
output token at each position:

In [None]:
example_dist

In [None]:
print("Input tokens:", example_inputs.input_ids)
for tok in example_inputs.input_ids.cpu().tolist()[0]:
    print(tok, "->", tokenizer.decode(tok).__repr__())
output_tokens = torch.argmax(example_dist, dim=1)
print("Output tokens:", output_tokens)
for tok in output_tokens.cpu().tolist():
    print(tok, "->", tokenizer.decode(tok).__repr__())

Read this as follows:

1. ` get` is the most likely next token after `def`
2. `(` is the most likely next token after `def hello`

...

6. ` "` is the most likely next token after `def hello():\n\treturn`

When doing generation, we only care about (6).


**Exercise:** Write a generate function that takes a prompt and generates
several tokens of output, stopping when it encounters the end-of-sequence
token, or when a given number of tokens have been generated.

In [14]:
def generate(prompt, max_tokens):
    pass

**Exercise:** Write a generate function that takes a batch of inputs.

In [15]:
def generate_batched(prompts, max_tokens):
    pass