Cluster Workshop

Introduction

The goal of this workshop is to become proficient at using the NCSA Delta cluster. The approach is universal: large-scale machine learning and scientific computing use similar tools and workflows, and Slurm is used almost everywhere.

We will work entirely in the terminal and will not use Jupyter or any GUI tools. This isn’t strictly necessary with Delta, but being an expert with the terminal is essential for working at scale with a cluster. For example, I often find myself working with text files that are too large to open in Visual Studio Code, but which can be opened in the terminal.

You will need to use a terminal-based text editor. If you haven’t used one before, start with nano: it is a straightforward editor where you can press “Control+S” to save and “Control+X” to exit. In the long run, I strongly recommend learning to use Vim, but that’s beyond the scope of this workshop. On many clusters, Vim and Nano are the only editors available, and there are situations you will not be able to use a GUI editor.

The New Delta

The Delta cluster is in the process of an upgrade to a modern software stack, and we are going to use this new stack, which is a lot easier to use. But, to use the new software stack, we need to use some specific settings, which we do below.

Making Authentication Easier

Let’s first start by tackling a common annoyance: having to remember the username and hostname for the cluster. To use Delta with the new software stack, you need to ssh USERNAME@dt-login04.delta.ncsa.illinois.edu, which is hard to remember. However, you can create an SSH alias on your laptop. To do so, create a file called ~/.ssh/config with the following contents:

Host delta
    User [YOUR USERNAME]
    HostName dt-login04.delta.ncsa.illinois.edu

If you create it successfully, you should be able to run ssh delta and login.

A second annoyance with Delta is that it prompts you to use Duo on every login. To avoid this, we can enable SSH connection multiplexing. This will allow us to login once and then use the same connection for subsequent logins. To do so, append the following lines to the ~/.ssh/config file:

    ControlMaster auto
    ControlPath ~/.ssh/master-%r@%h:%p
    ControlPersist yes

ControlMaster auto enables a single “master” connection to be shared by later SSH sessions to the same host, so they reuse the existing authenticated channel. ControlPath defines where the shared control socket is stored, with %r (remote user), %h (host), and %p (port) placeholders keeping sockets distinct. ControlPersist yes keeps the master connection running in the background after the first session ends, so new sessions open without re-authenticating.

After you add these lines, you should log out and log back in to Delta a few times. You will need to use Duo the first time, but subsequent logins will skip the Duo prompt.

Login vs. Compute Nodes

When you first login to a cluster, you are on one of several login nodes. A login node is a shared machine, and you can run who to see the list of other users on the login node. Clusters limit how much work you can do on a login node in one of several ways. Some clusters throttle how fast your programs can run on the login node. Other clusters, such as Delta, just kill programs that use too much CPU/RAM. At a high-level, the limit is designed so that you can run a text editor, download small files, and perhaps run some light scripts on the login node. But, to do real work, you need to schedule a job on a compute node.

To get a sense of how busy the cluster is, you can use the sinfo command, which shows the status of the compute nodes. Here are a few lines of output from running sinfo on Delta:

gpuA40x4                  up 2-00:00:00      1   drng gpub038
gpuA40x4                  up 2-00:00:00     26   resv gpub[004,007,010,013-014,017-018,020,027,031,033,036,045,050,054,059-060,065,067,075,081-083,088,094,096]
gpuA40x4                  up 2-00:00:00     43    mix gpub[005,008-009,011-012,016,019,022,024,028-030,040-041,043-044,047,049,052,056-058,061,063-064,066,068,070-073,077-078,080,084-086,089-090,092,097-098,100]
gpuA40x4                  up 2-00:00:00     28  alloc gpub[003,006,015,021,023,025-026,032,034-035,037,039,042,046,048,051,053,055,062,069,074,076,079,087,091,093,095,099]

This output indicates that there are 108 compute nodes available, each with 4 NVIDIA A40 GPUs. Of these nodes:

  • 1 node is being “drained” – something is wrong with it and the sysadmins are taking it offline for maintenance.
  • 26 are “reserved” for some special project. (Northeastern reserves several for the NDIF project.)
  • 43 are in a “mixed” state – people are using some but not all of the GPUs on them, so we can get an A40 immediately.
  • 20 are “alloc” – they are fully in use.

The final state is “idle”, which means a node is completely free. The only node on Delta that is consistently idle is the one with 8 AMD MI100 GPUs:

gpuMI100x8                up 2-00:00:00      1   idle gpud01

Finally, 2-00:00:00 means 2 days, and is the time limit for any single job.

You should run sinfo yourself and see which GPU types are available. When I want a result quickly, I use sinfo output to choose between GPU types. I will typically use an A40 or an A100 on Delta.

Submitting a Job

To get experience with submitting jobs, we will do a typical task: run a hyperparameter sweep with a small LLM. We will try five different learning rates and run each experiment as a separate job. This will allow the cluster to run several jobs concurrently, and we may get results faster. This is the kind of task where the cluster shines. We will cover interactive use later, which is more cumbersome.

Here’s a Python training script that will run on Delta:

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import json

def chat_tokenize(tokenizer, messages_list):
    tensor_dict = tokenizer.apply_chat_template(
        messages_list,
        tokenize=True,
        padding="max_length",
        truncation=True,
        max_length=1024,
        return_tensors="pt",
        return_dict=True,
    )
    tensor_dict["labels"] = tensor_dict["input_ids"]
    return tensor_dict

def main():
    MODEL = "Qwen/Qwen3-1.7B"
    batch_size = 2
    grad_acc_steps = 8
    max_steps = 20

    dataset = load_dataset("trl-lib/Capybara", split="train")
    tokenizer = AutoTokenizer.from_pretrained(MODEL, padding_side="left")
    model = AutoModelForCausalLM.from_pretrained(MODEL, dtype=torch.bfloat16).to("cuda")
    optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
    model.train()
    step = 0
    accum_loss = 0.0
    indices = list(range(len(dataset)))

    assert max_steps * batch_size <= len(dataset), "max_steps * batch_size must be less than or equal to the length of the dataset"
    for micro_step in range(max_steps * grad_acc_steps):
        messages_list = [dataset[idx]["messages"] for idx in indices[micro_step*batch_size:(micro_step+1)*batch_size]]
        batch = chat_tokenize(tokenizer, messages_list).to(model.device)
        outputs = model(**batch)
        loss = outputs.loss / grad_acc_steps
        loss.backward()
        accum_loss += loss.item()

        if (micro_step + 1) % grad_acc_steps != 0:
            continue

        step += 1
        optimizer.step()
        optimizer.zero_grad()
        print(json.dumps({ "step": step, "loss": accum_loss }), flush=True)
        accum_loss = 0.0

if __name__ == "__main__":
    main()

You should create a directory on Delta for this task (mkdir workshop) and save the script above to that directory (call it train.py). Don’t try to run the script on the login node. We will instead need to write an sbatch script to submit it to a compute node.

An sbatch script has two parts: a prefix that describes the resources you want from the compute node, and a suffix that is a shell script to run on that node. This is the suffix that we will use, which first activates a Conda environment that has a recent PyTorch and then starts your script:

module load pytorch-conda
conda activate base
python3 train.py

Every line of the prefix starts with #SBATCH. The cluster has several defaults, but I recommend using the lines below:

  • #SBATCH --partition=gpuA40x4-interactive: the job queue to use. You can use another queue if you like. The “interactive” queues on Delta cost more, but have higher priority. Consider using gpuA100x4-interactive if the A40s are too busy.

  • #SBATCH --account=bchk-delta-gpu: This is the account for the class, and the only account you have. It is possible to have several accounts on the cluster for different projects.

  • #SBATCH --gres=gpu:1: the number of GPUs to use. Our script only supports one GPU, so there is no point in asking for more.

  • #SBATCH --ntasks=8: how many CPU threads to use. Our script is GPU heavy and 8 will be enough.

  • #SBATCH --nodes=1: we just need one node. It may not be necessary to write this, but Slurm is capable of running each task (thread) on several nodes. We definitely don’t want this, and want all threads on exactly one node, which is why I request one node explicitly.

  • #SBATCH --mem=60G: how much memory to reserve on the node. On Delta, due to the way accounting works, there is no point in asking for less and asking for more costs more.

  • #SBATCH --time=0:20:00: A time limit for the job. It may be tempting to set this really high, but if you say your job will take a very long time, the cluster may delay starting it. We are only charged for the actual time that the job runs, and not for the time reserved.

  • #SBATCH --reservation=RH9: use the new software stack on Delta that uses Red Hat Enterprise Linux (RHEL) 9. At some point in the future, we will not need this.

Finally, an Sbatch script is a shell script, so the first line must be #!/bin/bash.

You should put all of this together to create an sbatch script (train.sbatch) and submit it with sbatch train.sbatch. You can check the status of your jobs with squeue -u $USER. When your job starts, its output will show up in a file called slurm-<job_id>.out.

After your first job starts, you can submit jobs with the learning rates 2e-5, 1e-5, 5e-6, 1e-6, and 5e-7. You can create copies of the train.sbatch file for each job. Alternatively, you can modify the training script to take a command-line argument for the learning rate. The sbatch script can also take a command-line argument. E.g., it is possible to modify the scripts so that you can run this:

sbatch train.sbatch 2e-5
sbatch train.sbatch 1e-5
sbatch train.sbatch 5e-6
sbatch train.sbatch 1e-6
sbatch train.sbatch 5e-7

Tip: I like to create an alias called myjobs as shown below. The complicated format string add some useful columns to the output, such as the estimated start time for the job.

alias myjobs="squeue -u $USER --format \"%.18i %.9P %.30j %.10T %.10M %.9l %.6D %.18R %S\""

You can add this to the file ~/.bashrc on Delta to make it permanent.

Using a Slurm Cluster Interactively

We will switch to a different task to learn how to use a Slurm cluster interactively: we will run a model with VLLM and evaluate its performance on a task. To do this, we need two windows open: one with VLLM and another that runs your evaluation script. We will learn how to install packages locally and how to use tmux.

Why Tmux?

The problem that tmux solves is that when you disconnect from the cluster, your current state is lost. This is a problem for interactive work, since disconnects occur often enough with WiFi.

Tmux is a terminal multiplexer, which allows you to create persistent terminal sessions that survive disconnections. When you run tmux on the login node, you will see a green status bar at the bottom of your terminal. This indicates that you are now inside a tmux session. Try entering a command such as echo hello to verify everything is working.

Now, close your terminal window or disconnect from Delta entirely. When you reconnect to Delta and run tmux attach, you will find yourself back in the exact same state—your command history, working directory, and any running processes will all be preserved. This is the core value of tmux: it allows you to maintain long-running interactive sessions even when your network connection is unstable. Delta will allow you to run a tmux session for up to 2 weeks.

Allocating a Compute Node for Interactive Use

Now that you have a tmux session running, you can allocate a compute node for interactive use. Use the following command:

salloc --reservation=RH9 --partition=gpuA40x4 --account=bchk-delta-gpu --nodes=1 --ntasks=8 --mem=60G --gres=gpu:1 --time=1:00:00

This does not start a job, but reserves a node for you. When it is ready, it prints the node name on the command line, and you can then ssh to it with ssh <node_name>. Notice that you are doubly-SSH’d: from your laptop to the login node and then from the login node to the compute node.

Installing Packages Locally

To use PyTorch on the compute node, you need to activate the Conda environment as we did earlier:

module load pytorch-conda
conda activate base

Unfortunately, this Conda environment does not have VLLM installed. We will create a Python virtual environment that can access the system packages (like PyTorch from the Conda environment) while allowing us to install additional packages like VLLM locally. This approach lets us use the pre-installed PyTorch without needing to reinstall it, which would be slow and waste disk space.

Create the virtual environment with the --system-site-packages flag, which allows it to access packages from the Conda environment:

python3 -m venv --system-site-packages myenv
source myenv/bin/activate

After running source myenv/bin/activate, your prompt should show (myenv) (base), indicating that both the virtual environment and the Conda base environment are active. Now install VLLM:

pip install "vllm==0.11.0"

Now you can run VLLM to serve the model:

vllm serve Qwen/Qwen3-1.7B

Tmux Windows

VLLM takes a few minutes to start. Watch for the first line, which must read “Automatically detected platform cuda.” When VLLM is ready, you will see “INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)”. VLLM runs continuously until you quit it, so you need another terminal window to query the model.

You can create a new “window” in tmux by pressing “Control+b” followed by “c”. You will see a new window number appear in the status bar at the bottom. To switch between windows, press “Control+b” followed by the window number (e.g., “Control+b” then “0” for the first window, “Control+b” then “1” for the second). Try switching between the VLLM window and the new window to get comfortable with this workflow.

Note that since you started tmux on the login node, the new window will also be on the login node, not the compute node. You will need to SSH to the compute node again with ssh <node_name>. If you’ve forgotten the node name, you can find it by running squeue -u $USER.

Once you’re back on the compute node in the new window, you can query the model using curl. For example:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-1.7B",
  "messages": [
    {"role": "user", "content": "Give me a poem about northeastern university <no_think>"}
  ],
  "max_tokens": 512
}'