Thomas the Travel Agent

Introduction

In this project, you will develop the natural language interface to a flight reservation system that we will call Thomas the Travel Agent1. You will add support for a rich set of queries, booking flights (faked, obviously), and try to characterize the capabilities and limitations of your system.

The Chat Completions API

The LLM that we will use in this assignment is Meta Llama 3.1 Instruct (8B), which you will use with the OpenAI Chat Completions API. You can read about this API more in the Chat Completions Guide. This snippet of code will help you get started:

from openai import OpenAI

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

resp = client.chat.completions.create(
    messages = [{ 
        "role": "user", 
        "content": "Write short complaint to The Boston Globe about the rat problem at Northeastern CS. Blame the math department. No more than 4 sentences." 
    }],
    model = "meta-llama/Meta-Llama-3.1-8B-Instruct",
    temperature=0)
print(resp.choices[0].message.content)

When you run this code, you may see something like this:

To the Editor,

I am writing to express my concern about the persistent rat problem at Northeastern University’s College of Science. The infestation has become a serious issue, and I believe it is largely due to the unsanitary conditions in the math department’s facilities. The lack of proper waste management and cleanliness in these areas has created an environment that is conducive to rodent activity. I urge the university to take immediate action to address this issue and ensure a safe and healthy environment for students and faculty.

The Dataset

We have prepared an imaginary flight dataset that you can load as follows. (You should be able to use this code verbatim.)

from typing import List
from datasets import load_dataset
from datetime import date, time, datetime
import dataclasses


@dataclasses.dataclass
class Flight:
    id: int
    date: date
    airline: str
    flight_number: str
    origin: str
    destination: str
    departure_time: time
    arrival_time: time
    available_seats: int


def parse_flight(flight):
    return Flight(
        id=flight["id"],
        date=datetime.strptime(flight["date"], "%Y-%m-%d").date(),
        airline=flight["airline"],
        flight_number=flight["flight_number"],
        origin=flight["origin"],
        destination=flight["destination"],
        departure_time=datetime.strptime(flight["departure_time"], "%H:%M").time(),
        arrival_time=datetime.strptime(flight["arrival_time"], "%H:%M").time(),
        available_seats=flight["available_seats"],
    )


def load_flights_dataset() -> List[Flight]:
    return [
        parse_flight(flight)
        for flight in load_dataset("nuprl/llm-systems-flights", split="train")
    ]

The Tools

Your travel agent will have access to two tools:

  1. A tool to search for flights
  2. A tool to book a flight

The booking tool is straightforward: it should take a flight’s ID and book it, if it has seats available. The search tool can be arbitrarily complex, but a minimum requirement is that it can search for flights between two cities on a given date.

Benchmarking the Agent

Benchmarking a multi-turn interaction is quite hard2. It is particularly hard to determine if the textual response from the agent is appropriate and helpful. Therefore, we will focus on evaluating the sequence of actions that the agent performs. Each response by the agent may perform one of three actions: booking a flight, searching for flights, or no action.3 Therefore, each benchmark problem will have a sequence of prompts and the expected actions that each prompt should elicit. We will give each benchmark a fractional score based on how many of the expected actions are taken, and we will abort early if the agent takes an unexpected action.

This is an example benchmark problem:

- prompt: I need to get from LA to jfk on jan 2
  expected_type: find-flights
  expected_result: [79, 80]
- prompt: book the first one
  expected_type: book-flight
  expected_result: 79
- prompt: Thank you!
  expected_type: text

There are four possible scores can agent can get:

  • An agent that immediately fails to find the right two flights should get a score of 0.
  • An agent that finds the first two flights, but fails to book the right one should get a score of 1/3.
  • An agent that finds both flights, books the right one, but crashes when it receives “Thank you!” should get a score of 2/3.
  • An agent that completes all the actions correctly should get a score of 1.

Benchmarking the agent in this way has its limitations. For example, there may be multiple action sequences that lead to the same desired outcome. But, this is a deliberate simplification for this assignment.

All you need to to is write the following benchmarking function:

def eval_agent(client: OpenAI, benchmark_file: str, flights: List[Flight]) -> float:
    """
    Evaluate the agent on the given benchmark YAML file.
    """
    ...

You may implement this however you like, but we recommend sticking to the following recipe.

  1. Write an Agent class to hold the state of the agent (the conversation and program state). Here is a suggested set of instance variables, but you can adjust them as you see fit.

    from typing import List, Optional
    
    class Agent:
    
        # The complete conversation with the LLM, including the system prompt.
        conversation: List[dict]
        # The formatted response from the last tool call.
        text_prefix: Optional[str]
        # The current database of flights. The tools update this database.
        flights: List[Flight]
        client: OpenAI
        # Global variables used in tool calls.
        program_state: dict
    
        ...
    
  2. Implement the tools as methods of the Agent class.

     class Agent:
    
         ...
         def find_flights(self, origin: str, destination: str, date: date) -> List[Flight]:
             ...
            
         def book_flight(self, flight_id: int) -> Optional[int]:
             ...
    
  3. Implement a say method that send’s a user’s message to the LLM, updates the agent state, and produces a result. The key is to ensure that the result is structured in a format that is amenable to benchmarking. Specifically, the result should have both text to show a user, and more structured data that indicates the actions that the agent performed.

     import dataclasses
    
    
     @dataclasses.dataclass
     class AgentResponse:
         """
         The superclass for all agent responses.
         """
         text: str
    
     @dataclasses.dataclass
     class FindFlightsResponse(AgentResponse):
         """
         The agent used the `find_flights` tool and found the following flights.
         """
         available_flights: List[int]
    
    
     @dataclasses.dataclass
     class BookFlightResponse(AgentResponse):
         """
         The agent used the `book_flight` tool and booked the following flight.
         """
         booked_flight: Optional[int]
    
    
     @dataclasses.dataclass
     class TextResponse(AgentResponse):
         pass
    
     class Agent:
    
         ...
         def say(self, user_message: str) -> AgentResponse:
             ...
    
  4. You can now implement the eval_agent function as follows:

    def eval_agent(client: OpenAI, benchmark_file: str, flights: List[Flight]) -> float:
        """
        Evaluate the agent on the given benchmark YAML file.
        """
        agent = ... # Initialize the agent
        with open(benchmark_file, "r") as file:
            steps = yaml.safe_load(file)
        for n, step in enumerate(steps):
            response = agent.say(step["prompt"])
            match step["expected_type"]:
                case "text":
                    if not isinstance(response, TextResponse):
                        return EvaluationResult(n / len(steps), agent.conversation)
                case "find-flights":
                    if not isinstance(response, FindFlightsResponse):
                        return EvaluationResult(n / len(steps), agent.conversation)
                    if set(response.available_flights) != set(step["expected_result"]):
                        return EvaluationResult(n / len(steps), agent.conversation)
                case "book-flight":
                    if not isinstance(response, BookFlightResponse):
                        return EvaluationResult(n / len(steps), agent.conversation)
                    if response.booked_flight != step["expected_result"]:
                        return EvaluationResult(n / len(steps), agent.conversation)
        return EvaluationResult(1.0, agent.conversation)       
    

Develop a Benchmark for Thomas

You may use Generative AI for this part of the assignment. But, note that LLMs tend to produce very “average” results. You will need to edit their responses to make them more interesting, or be creative with how you prompt the model.

You should develop a benchmark with at least 10 problems that test the strengths and limitations of your agent. You should also include both successful and unsuccessful interactions. In each benchmark, include a comment that describes why you’ve included it and how to worked to improve the agent’s performance on it.

Demo Thomas on Multi-Turn Conversations

You should use Generative AI for this part of the assignment.

Build a GUI for Thomas. It can either run as a desktop application or a web-based application. If you’re going the web-based route, consider using Gradio. You should submit a PDF (e.g., created in Word or whatever) that shows screenshots of your application, along with text describing what you’re trying to show. Be sure to show off multi-turn interactions, and include both successful and unsuccessful interactions.

  1. Thomas Cook was a travel agency that invented the cookie-cutter vacation. 

  2. See 𝜏-bench for a benchmark of multi-turn LLM interactions. 

  3. This is an over-simplification. Your agent may be capable of performing multiple actions in a single response, but we are going to ignore that.