Obscure Questions

Introduction

In this project we will develop a system to answer obscure questions. For our purposes, an obscure question is one that the LLM usually cannot answer correctly without some help. So, you will also implement a retrieval augmented generation (RAG) system that combines both TF-IDF and neural embeddings to retrieve documents that relevant to the question.

The Dataset of Documents

We have built a subset of English Wikipedia articles that mention the term “Northeastern University”.1 You can load this dataset as follows:

from datasets import load_dataset

dataset = load_dataset("nuprl/engineering-llm-systems", name="wikipedia-northeastern-university", split="test")

You may think that all mentions of Northeastern University have something to do with the university, but that is not the case. The dataset includes articles that merely cite scholarly work from Northeastern students and faculty. E.g., there are several articles that mention “Northeastern University Press”. Nevertheless, this makes it a more interesting dataset to work with.

The Obscure Questions

From these articles, we built a dataset of obscure questions in three steps:

  1. We prompted gpt4o-mini to generate 1-3 multiple choice questions from each article that have something to do with Northeastern University. We stated that the wrong answers should seem plausible.

  2. We then prompted gpt4o-mini to answer each question without seeing the article and excluded the questions that it got right.

  3. We stopped after a few hours and only got half way through the documents.

The questions that remain are the ones we deem obscure questions. We are working with Llama 3.1 8B Instruct, and its safe to assume that are obscure for gpt4o-mini are likely obscure for Llama 3.1 8B Instruct as well.

You can load all the obscure questions dataset as follows:

from datasets import load_dataset

dataset = load_dataset("nuprl/engineering-llm-systems", config_name="obscure_questions", split="test")

We have also constructed a subset of 50 obscure questions:

from datasets import load_dataset

dataset = load_dataset("nuprl/engineering-llm-systems", config_name="obscure_questions", split="tiny")

You should get at least 40% of the tiny split correct, but we encourage you to try to get more right, or even try the full test split.

Implementation Task

You may use Generative AI for this assignment, but be careful. There are lots of little decisions to make that can significantly affect the results. Make sure you fully understand any LLM-generated code that you use.

  1. You should implement a function with the following signature:

    from typing import List
    
    def answer_query(question: str, choices: List[str], documents: List[str]) -> str:
        """
        Answers a multiple choice question using retrieval augmented generation.
    
        `question` is the text of the question. `choices` is the list of choices
         with leading letters. For example:
    
         ```
         ["A. Choice 1", "B. Choice 2", "C. Choice 3", "D. Choice 4"]
         ```
    
         `documents` is the list of documents to use for retrieval augmented
         generation.
    
         The result should be the just the letter of the correct choice, e.g.,
         `"A"` but not `"A."` and not `"A. Choice 1"`.
         """
         ...
    
  2. You should submit a notebook where you use answer_query to benchmark its accuracy on tiny subset of obscure questions.

Suggested Approach

At a high-level, you will implement non-neural retrieval followed by neural re-ranking. We recommend starting with TF-IDF for the retriever followed by a BERT-based re-ranker. These are in the class notes.

However, feel free to get creative. Some things you might consider:

  1. Consider removing stop words from the queries. You can do this with a regular expression or with Spacy.

  2. In class, we cached IDF scores. Consider caching the TF scores well.

  3. Consider feeding the LLM the top N documents after re-ranking.

  4. You will need to chunk the documents for them to fit in the LLM context. Try to chunk better.

Speech and Language Processing has a chapter on RAG that I recommend reading for ideas on how to do better.

  1. These documents are a subset of this dump of Wikipedia.