DPR and RAG

Search Engines:
11-442 / 11-642

HW5: Dense Retrieval and Retrieval Augmented Generation
Due Apr 14, 11:59pm

Assignment Overview

The purpose of this assignment is to gain experience with dense passage retrieval (DPR) and retrieval augmented generation (RAG). This assignment consists of several parts.

Add a dense vector ranker and retrieval augmented generation to the QryEval search engine.
Conduct experiments with your search engine.
Write a report about your work.
Upload your work to the course website for grading.

1. New Capabilities

HW5 adds two new capabilities to your QryEval search engine:

A new, dense vector ranker, and
Retrieval augmented generation.

The dense vector is a new first-stage ranking option, serving a purpose similar to Ranked Boolean and BM25. Architecturally, it is just another ranker at the start of your ranking pipeline.

Retrieval augmented generation adds a new stage to your search engine architecture. Your system should treat RAG as an agent that consumes the results of the ranking pipeline. Thus, your HW5 system will consist of three stages.

ranker → rerankers (when specified) → agent (RAG)

A production system might support several different types of agent, similar to how your system now supports two rerankers (Ltr and Bert). Yours will have just the RAG agent.

See the Design Guide for more information about how to implement these new capabilities.

1.1. Conda Environment

Depending on your platform, you may need to upgrade your Conda environment or install a new one. Updating is easiest, so try that first. If it fails, try installing the new environment.

Upgrade: pip install SentencePiece
You may upgrade your Python to 3.9, if that helps.
New environment: conda env create -f 11x42-25S-b.yml
This is the environment that the homework testing service will use for HW5.

Windows people, see the FAQ if you get an error message when your system does generation. You may need to make a small adjustment to your conda installation.

1.2. Input

Your system's new capabilities are configured by the new parameters shown below.

qaEvalOutputPath: A path to the file where your software will write its question-answering results.
ranker:

retrievalAlgorithm: This parameter takes a new value, "dense".
dense:indexPath: A path to the FAISS index used for dense retrieval.
dense:modelPath: A path to the model used for encoding text (a query).

agent_1:

agentAlgorithm: The value is always "rag".
rag:modelPath: A path to the LLM used to generate answers.
rag:dense:modelPath: A path to the LLM used for passage selection. If ranker:dense:modelPath is provided the two paths will be identical (i.e., you can use the same model).
rag:psgLen: The maximum number of text tokens in a document passage.
rag:psgStride: The distance between the start of passage_i and passage_i+1. Applies only in the body field.
rag:psgCnt: The maximum number of passages to consider from one document.
rag:maxTitleLength: The maximum number of text tokens to use from the title string. The default is 0.
rag:prompt: A value from 1 to 5 that identifies the pattern used to form LLM prompts. The default is 1.

1.3. Output

The RAG agent writes results to a file in a format understood by squad_eval, the standard evaluation software for the SqUAD dataset. You must write the software that produces this output. We refer to this as a .qaIn file.

The .qaIn file is a simple json format that consists of question ids and answer values. An example is shown below.

{"56ddde6b9a695914005b962a": "norway", "56ddde6b9a695914005b962c": "16th century", ... }

The file contains one key/value pair for each question. Python's json library is a convenient way to write this file.

1.4. Data

This assignment is done with the ClueWeb09 inverted file index that you have used all semester, as well as with the following new data files.

co-condenser-marco-retriever (.zip or .tgz): A co-condenser model for encoding queries and passages as dense vectors. 387 MB compressed, 418 MB uncompressed.
index-cw09-faiss-t32b300-Fp: A FAISS dense vector index for ClueWeb09. Each document was encoded by the co-conderer-marco-retriever model using up to 32 title token and up to the first 300 tokens from the body field. The internal docids match the internal docids in the inverted index, so you can fetch information about the document (e.g., its title and body strings) from the inverted index, as you did in HW4. 1.6 GB.
flan-t5-base (.zip or .tgz): A Flan-T5-Base (250M parameters) instruction-following large language model. 3.6 GB compressed, 3.9 GB uncompressed.
t5-base (.zip or .tgz): A T5-Base (250M parameters) large language model. 2.0 GB compressed, 4.2 GB uncompressed.

1.5. Testing Your Software

Use the HW5 Testing Page to access the trec_eval and homework testing services.

You may do local testing on your laptop, as you did for HW1, HW2, and HW4. The HW5 test cases and grading files (combined into a single directory) are available for download (zip, tgz).

2. Experiments

Conduct experiments and analyses that investigate the effectiveness of vector-based retrieval and retrieval augmented generation in different situations. Test your models on the HW5 questions.

You will have an opportunity to test a default generation prompt and several custom prompts. The default prompt, prompt 1, is defined as follows.

    question: {question}
    context:
    {context}

The context is a passage selected from the top-ranked document.

All students must conduct a set of reproducible experiments. Undergraduate students must write brief reports that document their work. Graduate students must write longer reports that analyze the experimental results and draw conclusions.

2.1. Passage Selection

The first experiment examines the effects of passage selection strategies on a baseline retrieval augmented generation system.

The model and prompt are configured as follows.

agent_1:rag:modelPath: Varies {t5-base, flan-t5-base}
agent_1:rag:prompt: 1

The search engine returns documents, but usually it is impractical to pass an entire document to the generator. Instead, a passage is selected from the document. Investigate passage sizes from 25 to 200 tokens using firstp and bestp passage selection. Set psgCnt = 6 for bestp passage selection.

agent_1:rag:psgCnt: Varies {1, 6}
agent_1:rag:psgLen: Varies {25, 50, 100, 150, 200}
agent_1:rag:psgStride: Varies. Calculate as psgLen - 10.
agent_1:rag:maxTitleLength: 15

Report results in four tables: {firstp, bestp} × {t5-base, flan-t5-base}.

2.2. Prompt Engineering with Two LLMs

The second experiment investigates the effect of the prompt on two LLMs. Compare the default prompt to four custom prompts that you develop. Use two LLMs (flan-t5-base, t5-base) to help you identify trends.

It is not necessary for your custom prompts to improve on the default prompt. Experiments are evaluated on the quality and practicality of the hypotheses that are explored.

Configure the system (psgCnt, psgLen, etc) based upon your conclusions about the first experiment.

The result is two tables of experimental results.

2.3. Ranking Accuracy

The last experiment investigates the effect of ranking accuracy on answer quality produced by flan-t5-base and t5-base. Select one retrieval augmented generation configuration (passage selection, prompt, etc) based on your prior experiments. Invesigate how that configuration performs when the input ranking is varied.

Test your retrieval augmented generation system with rankings produced by the following ranking pipelines. You have the freedom to set parameters however you wish unless indicated otherwise.

BM25 → RAG
dense → RAG
dense → LTR → RAG
dense → BERT 6 layer → RAG
dense → BERT 12 layer → RAG

The result is two tables of experimental results.

Warning: This experiment is the most time-consuming of the three experiments. Be sure to leave yourself enough time to complete this experiment.

3. The Report

11-442 students must submit a report that contains a statement of collaboration and originality, and their experimental results. A template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

4. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW5-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.java).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

The Homework Services web page provides information about your homework submissions and access to graded homework reports.

5. Grading

The grading requirements and advice are the same as for HW1.

FAQ

If you have questions not answered here, see the HW5 FAQ and the Homework Testing FAQ.

Jamie Callan

HW5: Dense Retrieval and Retrieval Augmented Generation Due Apr 14, 11:59pm