The purpose of this assignment is to gain experience with several ways of using BERT to rerank an initial document ranking. This assignment consists of several parts.
Warning: This assignment has less implementation than other assignments, but the experiments take about five hours to run on a laptop.
Advice: After your software is debugged, consider setting up a script to run your experiments overnight.
HW4 extends your reranking architecture with a new BERT-based reranker and longer reranking pipelines. Rerankers should be evaluated by a loop over reranker_i, so that the architecture is not sensitive to the length of the reranking pipeline.
The new BERT reranker must support the following capabilities.
Use the new scores to rerank the top n documents in the ranking.
See the Design Guide for advice about how to implement these capabilities.
This assignment uses Pytorch and pretrained models from HuggingFace. In a GPU compute environment, the system would calculate scores for multiple documents in a single call. In a CPU compute environment, batching offers little advantage, which simplifies the architecture. Your system will form a (query, document) pair into a token sequence, encode it, and classify it to produce a ranking score.
helloBert.py is an example implementation that matches query text to document text. It is fairly easy to understand, and it confirms that your Anaconda environment is configured properly for HW4. Run helloBert in a directory that contains Idx.py, PyLu.py, a LIB_DIR subdirectory, and an INPUT_DIR subdirectory that contains index-cw09 and ms-marco-MiniLM-12-v2.
Your software must support all of the parameters used in previous homework, as well as the new parameters described below.
This assignment is done with two models trained by the SBERT project. Download them and put it in your INPUT_DIR directory
Your software must write search results to a file in trec_eval input format, as it did for previous homework.
Use the HW4 Testing Page to access the trec_eval and homework testing services.
You may do local testing on your laptop, as you did for HW1-HW3. The HW4 test cases and grading files (combined into a single directory) are available for download (zip, tgz).
Warning: BERT will use multiple CPU cores when they are available. The web servers are shared resources, so the number of CPU cores available to you will vary from 1-24 depending on how many other tests are running. Thus, the run times shown on the HW4 Testing page are unreliable, because they do not know how many CPU cores your test used. As the homework deadline approaches, we may restrict each test to four CPU cores, which will be slower. Run your tests early, if you can. Or, download the tests and run them on your laptop, which will be much faster.
Warning: The web servers are not powerful enough to run everyone's reproducibility tests to completion. The purpose of the reproducibility tests is to help you confirm that you have uploaded all of the right files and have correct file paths. The reproducibility tests will time out after about 30 seconds. That should be enough time for your software to access parameter, query, and other files to confirm that your configuration is correct.
Conduct experiments and an analysis that investigate the effectiveness of BERT in different situations. Use BM25 to generate an initial ranking of 1,000 documents for each test query. Rerank the top 250 documents in all of your experiments unless specified otherwise. Test your models on the HW1 bag-of-words queries. Use the same training data for LTR experiments that you used in HW3 (HW3-train.qry, HW3-train.qrel).
All students must conduct a set of reproducible experiments. Undergraduate students must write brief reports that document their work. Graduate students must write longer reports that analyze the experimental results and draw conclusions.
The first experiment examines the effects of reranking depth on the accuracy of LTR and BERT rerankers for the HW1 bag-of-words queries. Use the following retrieval methods.
Use a BM25 ranking depth of 1000, and reranking depths of 100, 250, and 500.
You may set the parameters for BM25, Coordinate Ascent, and ListNet based on your results from HW1-HW3. Try to use strong configurations.
Typically documents are divided into one or more passage(s) to control computational costs. The second experiment examines different approaches to forming passages and combining passage scoress into document scores. Test six approaches to defining passages with three methods of aggregating their scores into document scores.
The ranking pipeline is BM25 → BERT. Use a BM25 ranking depth of 1000, and a reranking depth of 250. Use the 6-layer BERT model.
Passage lengths, strides, and counts:
Score aggregation:
The result is three tables of experimental results.
Explore six different configurations of ranking pipelines that have three or more stages (rankers or rerankers), subject to the following constraints.
An example configuration is RankedBoolean (1000)→LTR CA (500)→BERT (100).
Set parameters based on your knowledge from HW1-HW3 and the first two experiments. You will need to defend your choices, so give some thought to your configurations. Grading depens on the quality of your hypotheses and how they show your understanding of search engines, not on how well the configuration actually works; some hypotheses fail, which is how we learn.
When creating ranking pipelines, it may help you to think about i) your Recall and Precision goals for rerankers at different locations in the pipline, and ii) your goal for the pipeline as a whole. You may set your own goals, for example, maximizing accuracy without worrying about efficiency, or finding a balance of accuracy and efficiency, or something else. You do not have enough computational power to do a parameter sweep, so let your goals guide you.
11-442 students must submit a report that contains a statement of collaboration and originality, and their experimental results. A template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.
11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.
Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW4-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.java).
Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.
The Homework Services web page provides information about your homework submissions and access to graded homework reports.
The grading requirements and advice are the same as for HW1.
If you have questions not answered here, see the HW4 FAQ and the Homework Testing FAQ.
Copyright 2024, Carnegie Mellon University.
Updated on November 11, 2024