Thomas the Travel Agent

Introduction

In this project, you will develop the natural language interface to a flight reservation system that we will call Thomas the Travel Agent¹. You will add support for a rich set of queries, booking flights (faked, obviously), and try to characterize the capabilities and limitations of your system.

The Chat Completions API

The LLM that we will use in this assignment is Meta Llama 3.1 Instruct (8B), which you will use with the OpenAI Chat Completions API. You can read about this API more in the Chat Completions Guide. This snippet of code will help you get started:

from openai import OpenAI

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

resp = client.chat.completions.create(
    messages = [{ 
        "role": "user", 
        "content": "Write short complaint to The Boston Globe about the rat problem at Northeastern CS. Blame the math department. No more than 4 sentences." 
    }],
    model = "llama3p1-8b-instruct",
    temperature=0)
print(resp.choices[0].message.content)

When you run this code, you may see something like this:

To the Editor,

I am writing to express my concern about the persistent rat problem at Northeastern University’s College of Science. The infestation has become a serious issue, and I believe it is largely due to the unsanitary conditions in the math department’s facilities. The lack of proper waste management and cleanliness in these areas has created an environment that is conducive to rodent activity. I urge the university to take immediate action to address this issue and ensure a safe and healthy environment for students and faculty.

Warning: Be sure to set a temperature. If you don’t, the model server picks a default value, I think 0.7, which works quite poorly with code.

The Dataset and Tools

We have prepared an imaginary flight dataset that you can load as folows:

from datasets import load_dataset

flights = load_dataset("nuprl/engineering-llm-systems", name="flights", split="train")

Your travel agent will have access to two tools:

A tool to search for flights
A tool to book a flight

The booking tool is straightforward: it should take a flight’s ID and book it, if it has seats available. The search tool can be arbitrarily complex, but a minimum requirement is that it can search for flights between two cities on a given date. We recommend giving tools the following signatures:

import datetime
from typing import List, Optional

def find_flights(origin: str, destination: str, date: datetime.date) -> List[Flight]:
    ...

def book_flight(flight_id: int) -> Optional[int]:
    ...

The Agent

You should not use Generative AI for this part of the assignment.

Write an agent that is a command-line Python program called thomas.py that takes a single argument, which is the name of a the model to use. So we can launch it as follows:

python3 thomas.py llama3p1-8b-instruct

The customer should be able to type messages directly from the command line, pressing enter to send each message. The agent should respond appropriately to every message that it receives, and execute actions as needed. A blank message should signal the termination of the conversation and the program. Upon termination, the final output should be the list of flights that were booked (displaying an empty list [] if no flights were booked). We will ues this final output to scaffold testing.

Example: Here is an example interaction with an agent that we wrote:

% python3 thomas.py llama3p1-8b-instruct
User (blank to quit): what are teh flights available from bos to dallas on jan 1
Agent: Unfortunately, there are no flights available from BOS to DAL on January 1.
User (blank to quit): i mean dallas forth worth airport?
Agent: Flight WN5593 departs at 02:57 and is fully booked. Flight B67515 departs at 15:21 and has 185 available seats.
User (blank to quit): ok let's do the second
Agent: I booked you a seat on Flight B67515 departing at 15:21 and arriving at 17:21.
User (blank to quit): thanks
Agent: You're welcome!
User (blank to quit): 
[2]

Notice that the final line is the list of flights that were booked.

Testing Scaffold

Testing a multi-turn interaction is quite hard². We will use a simple testing scaffold run_tests.py to evaluate your agent that you can use as well. The script is designed to read tests from a YAML file, and run several tests concurrently. It also runs multiple tries (default 10) for each test. Read the code for more information.

Develop a Benchmark for Thomas

You may use Generative AI for this part of the assignment. But, note that LLMs tend to produce very “average” results. You will need to edit their responses to make them more interesting, or be creative with how you prompt the model.

You should develop a benchmark with at least 10 problems that test the strengths and limitations of your agent. You should also include both successful and unsuccessful interactions.

Thomas Cook was a travel agency that invented the cookie-cutter vacation. ↩
See 𝜏-bench for a benchmark of multi-turn LLM interactions. ↩