Thomas the Travel Agent
Introduction
In this project, you will develop the natural language interface to a flight reservation system that we will call Thomas the Travel Agent1. You will add support for a rich set of queries, booking flights (faked, obviously), and try to characterize the capabilities and limitations of your system.
The Chat Completions API
The LLM that we will use in this assignment is Qwen 3 4B Instruct 2507. You will not be loading the model directly, but will use a model that we are hosting on a server. You can access the model using LiteLLM as follows:
# /// script
# requires-python = "==3.12.*"
# dependencies = [
# "litellm>=1.77, <2",
# ]
# ///
import os
import litellm
os.environ["OPENAI_API_KEY"] = "dummy"
os.environ["OPENAI_BASE_URL"] = "http://10.200.206.231:8000/v1"
resp = litellm.completion(
model="openai/qwen3",
messages=[
{
"role": "user",
"content": "Write short complaint to The Boston Globe about the rat problem at Northeastern CS. Blame the math department. No more than 4 sentences.",
}
],
temperature=0,
)
print(resp.choices[0].message.content)
When you run this code, you may see something like this:
The Boston Globe,
The math department at Northeastern is clearly responsible for the rampant rat infestation on campus—why else would the only building with a full complement of calculus textbooks also be the epicenter of this pest problem? We’ve seen students’ lab notes replaced with droppings and equations, and the dean still hasn’t addressed it. This is not a sanitation issue—it’s a mathematical oversight.
—A Concerned Student
Two important notes:
-
Be sure to set a temperature. If you don’t, the model server picks a default value, I think 0.7, which works quite poorly with code.
-
The IP address above will only work on the Northeastern campus network or over the Northeastern VPN.
The Dataset and Tools
We have prepared an imaginary flight dataset that you can load as follows:
from datasets import load_dataset
flights = load_dataset("nuprl/engineering-llm-systems", name="flights", split="train")
Your travel agent will have access to two tools:
- A tool to search for flights
- A tool to book a flight
The booking tool is straightforward: it should take a flight’s ID and book it, if it has seats available. The search tool can be arbitrarily complex, but a minimum requirement is that it can search for flights between two cities on a given date. We recommend giving tools the following signatures:
import datetime
from typing import List, Optional
def find_flights(origin: str, destination: str, date: datetime.date) -> List[Flight]:
...
def book_flight(flight_id: int) -> Optional[int]:
...
The Agent
You should not use Generative AI for this part of the assignment.
Write an agent that is a command-line Python program called thomas.py that
we will launch as follows:
uv run thomas.py
The customer should be able to type messages directly from the command line,
pressing enter to send each message. The agent should respond appropriately to
every message that it receives, and execute actions as needed. A blank message
should signal the termination of the conversation and the program. Upon
termination, the final output should be the list of flights that were booked
(displaying an empty list [] if no flights were booked). We will use this
final output to scaffold testing.
Example: Here is an example interaction with an agent that we wrote:
% uv run thomas.py
User (blank to quit): what are teh flights available from bos to dallas on jan 1
Agent: Unfortunately, there are no flights available from BOS to DAL on January 1.
User (blank to quit): i mean dallas forth worth airport?
Agent: Flight WN5593 departs at 02:57 and is fully booked. Flight B67515 departs at 15:21 and has 185 available seats.
User (blank to quit): ok let's do the second
Agent: I booked you a seat on Flight B67515 departing at 15:21 and arriving at 17:21.
User (blank to quit): thanks
Agent: You're welcome!
User (blank to quit):
[2]
Notice that the final line is the list of flights that were booked.
Testing Scaffold
Testing a multi-turn interaction is quite hard2. We will use a simple testing scaffold run_tests.py to evaluate your agent that you can use as well. The script is designed to read tests from a YAML file, and run several tests concurrently. It also runs multiple tries (default 10) for each test. Read the code for more information.
Develop a Benchmark for Thomas
You may use Generative AI for this part of the assignment. But, note that LLMs tend to produce very “average” results. You will need to edit their responses to make them more interesting, or be creative with how you prompt the model.
You should develop a benchmark with at least 10 problems that test the strengths and limitations of your agent. You should also include both successful and unsuccessful interactions.
-
Thomas Cook was a travel agency that invented the cookie-cutter vacation. ↩
-
See 𝜏-bench for a benchmark of multi-turn LLM interactions. ↩