Search Engines:
11-442 / 11-642 / 11-742
 
CMU logo
 

HW4: Diversification Design Guide

Software Architecture
Development Sequence

 

Software Architecture

Architecturally, there are five main components to this assignment.

Reranker class: Start by creating a new reranker class for diversity reranking (e.g., RerankWithDiversity). Its input is a ranking produced by an initial ranker (e.g., BM25, Indri, an .inRank file) and a block of reranker_n parameters.

Intent rankings: Explicit diversification algorithms require each document from the initial ranking to have a score for the query and for each query intent. The ranker knew about each query (e.g., michael jordan), but it did not know about diversification reranking, so it did not provide initial rankings for each query intent (e.g., michael jordan basketball, michael jordan movies). This creates a problem: The reranker needs some initial rankings that it does not have yet.

You could modify the Ranker to produce the intent rankings before calling the diversity reranker, but that would greatly complicate your system. We recommend that you do not modify your Ranker class.

Instead, it is simpler and less work to have the diversity reranker allocate a new Ranker instance and use the new instance to create rankings for the search intents. The new instance will have slightly different parameters from the Ranker instance used for queries. The queryFilePath, inRankFile:Path, and/or outputLength may require adjustment (e.g., queries vs. query intents) based on the contents of the reranker_n parameter block. To keep things simple, your system may make the following assumptions.

If you find yourself writing new BM25 or Indri code, you are doing it wrong. The diversity reranker should be calling your HW2 code to produce rankings for search intents.

Organize the rankings: Each query now has i+1 rankings: for the original query and i query intents. Implementation of the diversification algorithms is simpler if the i+1 rankings are reorganized into a table where each row is a document, and each column is a score (query score in column 0, intent scores in columns 1-i). This step is optional, but recommended.

Normalization: The diversification algorithms assume document scores are in the range [0.0, 1.0]. If any score for query q or any of its intents is greater than 1.0, normalize the scores to [0.0, 1.0]. See the HW4 page for details.

Diversification: Implement PM-2 and xQuAD.

Development Sequence

Start by testing your system with the simplest (and fastest) configuration, which is Indri rankings (scores in the range [0.0, 1.0]) produced by .inRank files (avoids problems with your initial rankings). This allows you to debug the diversification algorithms without worrying about errors in other parts of your system.

Next, test your system with rankings produced by your system (i.e., not .inRank files). Any problems are probably caused by how the intent rankings are generated.

Finally, test your system with BM25 scores. Any problems are probably due to score normalization.

 

FAQ

If you have questions not answered here, see the HW4 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on March 20, 2024

Jamie Callan