Khoury Chatbot

You may use Generative AI for this assignment, but be careful. There are lots of little decisions to make that can significantly affect the results. Make sure you fully understand any LLM-generated code that you use and don’t let the model make decisions for you.

The goal of this project is to build a chatbot that can answer questions about the computing programs at Northeastern University. You should be able to ask it questions such as, “How frequently should I meet with my academic advisor?” or “I’m an incoming transfer student to computer science. How many credits can I transfer?”.

There are three parts to the project:

You will build a dataset of question-answer pairs that a student might ask about the computing programs.
You will train a model to answer these questions.
You will design an evaluation scheme to evaluate the model.

Dataset Building

Let’s first build a dataset of question-answer pairs. The dataset should include a mix of questions that are directly answered in the text or table and questions that require some reasoning or inference to answer. It should also include a mix of yes/no questions and questions that have open-ended responses.

First step for this project is to build a source dataset of questions and answers. You will need to scrape the information from selected sections of Academic Catalog.

Source Dataset Scraping

Do this in a notebook named scraping.ipynb. Clearly label each task using markdown before the appropriate section, e.g. ## Task n.

Task 1. Download the content of the following URLs and their subpages:

https://catalog.northeastern.edu/undergraduate/computer-information-science/
https://catalog.northeastern.edu/graduate/computer-information-science/

Here, a subpage is defined as a page with URL that starts with the URL of the parent page. For example, one of the many subpages of https://catalog.northeastern.edu/undergraduate/computer-information-science/ is https://catalog.northeastern.edu/undergraduate/computer-information-science/computer-science/.

You should have a list of pages, where each page is represented by a dictionary that has the following keys and values:

url: URL of the page
content: Content of the page in markdown, as a string. The content you extract is initially in HTML format. You will need to convert it to markdown format.

Task 2. Demonstrate your work above by displaying the content of Computer Science, Minor in the notebook. Your scraper should successfully convert both the “Overview” and “Requirements” sections to markdown format, even though in the HTML they appear as different tabs.

Now we have a basic source dataset which will serve as basis for building the question-answer pairs. Before making the question-answer pairs, we need to perform some post-processing to the webpage content. The goal is to then use an LLM to generate some question-answer pairs based on the content of the webpage.

Task 3. Perform post-processing on your data. At minimum, you should do the following:

Remove blank lines from the content.
Address duplicated information in the content by:
1. Breaking the content of the webpages into sections
2. Computing the similarity between each pair of sections
3. Displaying the top 20 most similar sections
4. Removing sections that are similar to each other and you think are not necessary for the question-answer pairs

You may find the TfidfVectorizer and KMeans from scikit-learn useful for this task; however, feel free to use any other method you prefer.

Task 4. Save your post-processed data to a file as a dataset or dataframe.

Generating Question-Answer Pairs

The next step is to generate a dataset of question-answer pairs that is grounded in the source data. You should generate up to 20 question-answer pairs per page using Llama 3.1 8B Instruct.

Task 5. Write the template you will use to prompt the model. The template should be in markdown format and should prompt the model to generate question-answer pairs based on the content of the webpage. Save your template to template-for-qa.md.

Task 6. Write code to generate question-answer pairs using the dataset as grounding and the template from the previous task. Save this code to a file called create_qa_ds.py and save the generated dataset. In addition, save a sample of 10 question-answer pairs to sample-qa.txt.

Model Fine-Tuning

See the instructions on how to fine-tune models on NCSA Delta here.

Now that we have a dataset of question-answer pairs, we will train a model to answer these questions. We will use a smaller model, meta-llama/Llama-3.2-1B, for this task. Note that this is not a instruct-tuned model.

Task 7. Fine-Tune the model to answer the questions in the dataset you created in Task 6 in a script called finetuning.py. The suggestions provided during class regarding fine-tuning will be valuable for this task.

Task 8. In the notebook eval.ipynb, demonstrate your fine-tuned model by generating answers to 10 questions. You should include

5 questions derived from the dataset created in Task 6, but with rephrased terms or text.
5 additional questions from your program.

Task 9. Answer the following questions in the notebook, as Markdown cells:

How are the tone, the style, and the accuracy of the answers generated by the model?
What can you conclude about fine-tuning a model on a dataset of question-answer pairs of questions of a specific domain?