Reranking with BERT

Search Engines:
11-442 / 11-642 / 11-742

HW5: Reranking with BERT
Due Apr 15, 11:59pm

Assignment Overview

The purpose of this assignment is to gain experience with several ways of using BERT to rerank an initial document ranking. This assignment consists of several parts.

Add a BERT reranker to the QryEval search engine.
Conduct experiments with your search engine.
Write a report about your work.
Upload your work to the course website for grading.

Warning: This assignment has less implementation than other assignments, but the experiments take about five hours to run on a laptop.

Advice: After your software is debugged, consider setting up a script to run your experiments overnight.

1. New Retrieval Capabilities

HW5 extends your reranking architecture with a new BERT-based reranker and longer reranking pipelines. Rerankers should be evaluated by a loop over reranker_i, so that the architecture is not sensitive to the length of the reranking pipeline.

The new BERT reranker must support the following capabilities.

Represent a document as one or more passages.
Use BERT to encode a (query, passage) pair.
Use BERT to calculate a score for the (query, passsage) pair.
Use a score aggregation method to combine scores for multiple passages into a score for the document.

Use the new scores to rerank the top n documents in the ranking.

See the Design Guide for advice about how to implement these capabilities.

1.1. Machine Learning Toolkit

This assignment uses Pytorch and pretrained models from HuggingFace. In a GPU compute environment, the system would calculate scores for multiple documents in a single call. In a CPU compute environment, batching offers little advantage, which simplifies the architecture. Your system will form a (query, document) pair into a sequence, encode it, and classify it to produce a ranking score.

helloBert.py is an example implementation that matches query text to document text. Run helloBert in a directory that contains Idx.py, PyLu.py, a LIB_DIR subdirectory, and an INPUT_DIR subdirectory that contains index-cw09 and ms-marco-MiniLM-12-v2.

2.2. Parameters

Your software must support all of the parameters used in previous homework, as well as the new parameters described below.

reranker_<n>: The value is a dict that configures a reranker. "n" indicates the reranker's position in the ranking pipeline (lower values first). HW5 uses multiple rerankers, i.e., "reranker_1", "reranker_2", "reranker_3".
- rerankAlgorithm: Which algorithm to use for reranking. The value is "ltr" or "bertrr".
- rerankDepth: The maximum number of documents to rerank.
- bertrr:modelPath: A path to the BERT model used for reranking.
- bertrr:psgLen: The maximum size of a passage from the body field.
- bertrr:psgStride: The distance between the start of passage_i and passage_i+1. Applies only in the body field.
- bertrr:psgCnt: The maximum number of passages from the body field to use when calculating a document score.
- bertrr:maxTitleLength: The maximum number of text tokens to use from the title string. The default is 0.
- bertrr:scoreAggregation: How a document score is calculated using its passage scores. Values are "firstp", "avgp", or "maxp".

2.3. Data

This assignment is done with two models trained by the SBERT project. Download them and put it in your INPUT_DIR directory

ms-marco-MiniLM-L-6-v2 (.zip, .tgz): 6 layers, 159MB compressed, 174 MB uncompressed.
ms-marco-MiniLM-L-12-v2 (.zip, .tgz): 12 layers, 235MB compressed, 255 MB uncompressed).

2.4. Output

Your software must write search results to a file in trec_eval input format, as it did for previous homework.

2.5. Testing Your Software

Use the HW5 Testing Page to access the trec_eval and homework testing services.

You may do local testing on your laptop, as you did for HW1-HW3. The HW4 test cases and grading files (combined into a single directory) are available for download (zip, tgz).

Warning: BERT will use multiple CPU cores when they are available. The web servers are shared resources, so the number of CPU cores available to you will vary from 1-24 depending on how many other tests are running. Thus, the run times shown on the HW5 Testing page are unreliable, because they do not know how many CPU cores your test used. As the homework deadline approaches, we may restrict each test to four CPU cores, which will be slower. Run your tests early, if you can. Or, download the tests and run them on your laptop, which will be much faster.

Warning: The web servers are not powerful enough to run everyone's reproducibility tests to completion. The purpose of the reproducibility tests is to help you confirm that you have uploaded all of the right files and have correct file paths. The reproducibility tests will time out after about 30 seconds. That should be enough time for your software to access parameter, query, and other files to confirm that your configuration is correct.

2. Experiments

Conduct experiments and an analysis that investigate the effectiveness of BERT in different situations. Use BM25 to generate an initial ranking of 1,000 documents for each test query. Rerank the top 250 documents in all of your experiments unless specified otherwise. Test your models on the HW3-Exp queries. Use the same training data for LTR experiments that you used in HW3 (HW3-train.qry, HW3-train.qrel).

All students must conduct a set of reproducible experiments. Undergraduate students must write brief reports that document their work. Graduate students must write longer reports that analyze the experimental results and draw conclusions.

2.1. Reranking Depth

The first experiment examines the effects of reranking depth on the accuracy of LTR and BERT rerankers for the HW3-Exp queries. Use the following retrieval methods.

BM25 (your choice of parameters)
Coordinate Ascent
ListNet
BertRr 6-layer model with psgLen=150, psgStride=125, psgCnt=3, scoreAggregation="maxp", and maxTitleLength=16.
BertRr 12-layer model with psgLen=150, psgStride=125, psgCnt=3, scoreAggregation="maxp", and maxTitleLength=16.

Use a BM25 ranking depth of 1000, and reranking depths of 100, 250, and 500.

You may set the parameters for BM25, Coordinate Ascent, and ListNet based on your results from HW1-HW3. Try to use strong configurations.

2.2. Passages and Aggregation Methods

Typically documents are divided into one or more passage(s) to control computational costs. The second experiment examines different approaches to forming passages and combining passage scoress into document scores. Test six approaches to defining passages with three methods of aggregating their scores into document scores. Use the 6-layer model.

Passage lengths, strides, and counts:

50 tokens, 25 tokens, 3 passages
50 tokens, 25 tokens, 6 passages
100 tokens, 75 tokens, 3 passages
100 tokens, 75 tokens, 6 passages
200 tokens, 175 tokens, 3 passages
200 tokens, 175 tokens, 6 passages

Score aggregation:

firstp: Use just the score of the first passage
maxp: Use just the score of the highest scoring passage
avgp: Use just the average of the passage scores

The result is three tables of experimental results.

2.3. Reranking Configurations

Explore six different configurations of ranking pipelines that have three or more stages (rankers or rerankers), subject to the following constraints.

You must use BM25 and RankedBoolean as initial rankers;
Every configuration must use the BERT reranker;
At least one configuration must end with BERT;
At least one configuration must end with LTR; and
At least one configuration must have two BERT stages.

An example configuration is RankedBoolean (1000)→LTR CA (500)→BERT (100).

Set parameters based on your knowledge from HW1-HW3 and the first two experiments. You will need to defend your choices, so give some thought to your configurations. Grading depens on the quality of your hypotheses and how they show your understanding of search engines, not on how well the configuration actually works; some hypotheses fail, which is how we learn.

3. The Report

11-442 students must submit a report that contains a statement of collaboration and originality, and their experimental results. A template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

11-642 and 11-742 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

4. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW5-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.java).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

The Homework Services web page provides information about your homework submissions and access to graded homework reports.

5. Grading

The grading requirements and advice are the same as for HW1.

FAQ

If you have questions not answered here, see the HW5 FAQ and the Homework Testing FAQ.

Jamie Callan

HW5: Reranking with BERT Due Apr 15, 11:59pm