Search Engines:
11-442 / 11-642

HW5: DPR and RAG Design Issues

helloHW5
DPR Architecture
RAG Architecture
Evaluation
Development Sequence

helloHW5

helloHW5 is a simple example program that provides examples of i) how to use the FAISS index for dense passage retrieval, and ii) how to use the text of a passage to do retrieval augmented generation. It demonstrates how to interact with the FAISS library, the transformers library, and the T5 models used in this assignment. Your software will be a little more complex than helloHW5, but this is a good place to start.

helloHW5 has hardcoded parameters for the locations of the indexes and model files. Probably they match your configuration. If they do not, they are easy to change.

Note that you will need to update your Anaconda environment before you can run helloHW5.

DPR Software Architecture

You need to download a new, dense vector FAISSindex for HW5. Documents were encoded by a co-condenser model (a better trained version of the condenser architecture that was discussed in class) using up to 32 title tokens + the first 300 tokens in a document.

See helloHW5 for working examples of how to encode a query and use it to retrieve the top n documents.

The inverted index (index-cw09) and the FAIS index (index-cw09-faiss-t32b300-Fp) use the same internal document ids. When the FAISS index returns an internal docid, you can use that id to fetch information about the document (e.g., title and body strings) from the inverted file index.

As with your other first-stage rankers, DPR produces a document ranking.

RAG Software Architecture

Passage Creation

Your ranking pipeline produces documents. RAG is done with passages. Thus, the first step of RAG is to produce passages. This process is identical to your passage generation in HW4. If HW4 your software was written in a modular fashion (e.g., body text → [ passages ]), you can simply reuse your software from HW4.

The rag:psgLen, rag:psgStride, rag:psgCnt and rag:maxTitleLength parameters control passage generation for RAG. The meanings are the same as for the similar parameters used by the BERT reranker.

Note: Your software should not be surprised if different values are used by BERT and RAG. For example, BERT reranking might use passages of 250 tokens, but RAG might use passages of 150 tokens.

Passage Selection

Your RAG agent may use the first passage from a document, or the best passage from a document. The rag:psgCnt parameter controls this choice. If the value is 1, use the first passage. Otherwise, select the best from the first rag:psgCnt passages.

The similarity of the question to a passage can be determined using BERT (as in HW4) or using vector comparison (dense). Vector comparison is slightly faster and more the focus of HW5, so use that approach.

Selecting the best passage from document d is a simple loop.


      encode query q 

      for i in range(rag:psgCnt): 

            encode passage p_i 

            score = q dot p_i

helloHW5.py provides an example of how to encode a query into a dense vector for retrieval. That approach can be used to encode any arbitrary text, for example, a passage that you create. The similarity of q and p_i is calculated as follows.


      score = q.dot(p_i.T).squeeze())

Keep the passage with the highest score.

Prompts

The rag:prompt parameter determines which prompt format is used. If rag:prompt == 1, use the standard prompt shown below.

  f"question: {qString}\n" +
   "context:\n" +
   "\n".join(passages))

If rag:prompt > 1, use a custom prompt format that you define.

Evaluation

Generate evaluation metrics as follows.


      python squad_eval_v2.0p.py
      dev-v2.0.json
      HW5-Train-25.qaOut

The Stanford Question Answering Dataset (SQuAD) uses the following metrics.

exact: The percentage of answers that match any of the ground truth answers.
f1: The macro-averaged harmonic mean of Precision and Recall of ground truth answer tokens. When there are multiple ground truth answers, calculate f1 for each of them and use the maximum.
total: Number of questions considered.

Sometimes the correct answer is "no answer" or a null prediction. The SQuAD evaluation software reports results for the entire dataset (the metrics above), and also for each subset (HasAns_xxx and NoAns_xxx).

Development Sequence

Start with DPR, because it fits easily into your HW1 architecture, and it is the fastest initial ranker (highly engineered C++ vs. homework-quality Python for BM25).

As you develop DPR, keep in mind that RAG will need to call the dense text encoder when it does passage selection. Your software may require less RAM and be a little more efficient if DPR and RAG use the same instances of the dense tokenizer and model, instead of each using their own copies. This will be easier if those variables have more of a global scope, instead of being nicely encapsulated in an object that only the Ranker sees. Just FYI.

Once DPR works, implement the simplest RAG configuration:

rag:psgCnt=1 (one passage per document), and
rag:prompt = 1 (the standard prompt).

You will also need to implement the software to write qaIn files. Use python's json library. Make sure that the homework testing service can produce a grade for your submission.

Test that everything works with the second LLM. It should work without any changes.

Improve quality by selecting the best passage from a document (rag:psgCnt > 1) instead of using just the first passage. This requires you to generate a dense vector for each passage and compare it to the query.

Experiment with custom prompts. There is no gold standard for these experiments, but by now you know the difference between good scores and bad scores.

Finally, make sure that everythings runs smoothly with BM25, Ltr, and BERT. These experiments run slowly, so they are left until last, but if you did not disrupt your pipeline from HW4, they should just work.

FAQ

If you have questions not answered here, see the HW5 FAQ and the Homework Testing FAQ.

Jamie Callan