Search Engines:
11-442 / 11-642
 
CMU logo
 

HW5: Diversification Design Guide

Software Architecture
Scaling Document Scores
Monotonic PM-2 Scores
Development Sequence
Metrics

 

Software Architecture

Architecturally, there are five main components to this assignment.

Reranker class: Create a new reranker class for diversity reranking (e.g., RerankWithDiversity). This reranker will always be reranker_1.

Intent Scores: Explicit diversification algorithms require each document from the initial ranking to have a score for the query and for each query intent. The ranker produced a score for the query (e.g., michael jordan), but not for the query intents (e.g., michael jordan basketball, michael jordan movies), so the diversification reranker must produce those scores.

When the diversification reranker is initialized, it reads the file of query intents (e.g., michael jordan basketball, michael jordan movies, etc), parses query intents into lists of query terms, and stores them.

During diversification reranking, for each document in the top n of the initial ranking, calculate a score for each query intent.

  for each document d_j:
    fetch the body termVector from the index
    for each query intent q_i:
      calculate a BM25 score for (q_i, d_j)

To keep thing simple, the ranker is always BM25, and it uses the initial ranker's k_1 and b parameters. This score calculation is similar to what your LTR system did for feature f5. Remember the HW3 advice to cache frequently-used values (e.g, RSJ weights, average field length, etc).

Organize the rankings: Implementation of the diversification algorithms is simpler if the i+1 scores for each document are organized into a table where each row is a document and each column is a query intent (query score in column 0, intent scores in columns 1 ... i). This step is optional, but recommended.

Scaling: The diversification algorithms assume document scores are in the range [0.0, 1.0]. If any score for query q or any of its intents is greater than 1.0, scale the scores to [0.0, 1.0]. See below for details.

Diversification: Implement PM-2 and xQuAD.

 

Scaling Document Scores

PM-2 and xQuAD can be used with any retrieval algorithm. However, both algorithms expect document scores to be probabilities in the range [0.0, 1.0]. These algorithms do not work properly when document scores are outside of this range. BM25 scores can be outside of this range.

Scores from a single ranking can be converted to probabilities by calculating the sum of document scores in the ranking, and then dividing scores by that sum of scores value. However, when diversifying the results for query query q with intents qi, all of the rankings should be scaled with the same value, so that document scores remain comparable after scaling. Thus, for query q, calculate the sum of scores value for each ranking independently, and then use the maximum value to scale all of the rankings.

For example, given the following BM25 ranking for query q and intent scores for q1 and q2:

Rank q q1 q2
1 7.788927555 14.67540193 11.25618291
2 7.772853851 14.65528709 11.28977573
3 7.772853851 14.71282333 11.28977573
4 7.753368378 14.65376401 11.23132658
Sum 31.08800364 58.69727635 45.06706095

The maximum sum of scores is 58.69727635. Thus, divide the document scores in all three of these rankings by 58.69727635.

Only do this calculation with the top reranker_1:rerankDepth documents in each ranking. Documents below this depth are ignored by diversification, so their scores do not play a role in scaling. If you use them anyway, your scores (and thus your diversified rankings) will be a little different from the reference system.

For a more detailed example, see these input rankings and the corresponding scaling values.

The scaling of one query (e.g., query 151) is independent of the scaling of other queries (e.g., query 152).

 

Monotonic PM-2 Scores

 

Usually PM-2 produces rankings that have monotonically decreasing document scores, but not always. For example, PM2 can produce this ranking:

   :  :       :           :        :          :          :
  157 Q0 clueweb09-en0010-50-26285 19 0.004970394127 reference
  157 Q0 clueweb09-enwp03-34-00477 20 0.005192763939 reference
   :  :       :           :        :          :          :
  

PM2 creates a document ranking. It does not guarantee that the ranking will be preserved if the list is sorted by the scores. trec_eval and nd_eval sort the rankings, by the score, so the ranking used to generate metrics may not be exactly the ranking that your system produced. This can be a source of error that is difficult to find, because you and the evaluation software are examining slightly different rankings. Beware.

Your software must produce monotonically decreasing scores in the PM2 ranking. It does not matter what the scores are, because trec_eval and nd_eval ignore the scores (after sorting is complete). One simple solution is something like this:

  if score(ranki+1) >= score(ranki)
    score(ranki+1) = score(ranki) * 0.999

 

Development Sequence

Start by generating the table of query intent scores mentioned above in "Organize the Rankings". You can check that the table of query intent scores is correct by running your HW1 BM25 ranker on the query intents and comparing the scores in the .teIn file to the scores in the table of query intent scores. They should be identical for the top n documents.

Next, implement scaling of document scores. Scaling for a particular query and its query intents is required only if any relevance or intent score is greater than 1.0. If the document scores for query q and all of its query intents q_i are ≤ 1.0, the scores do not need to be scaled.

Then implement xQuAD and PM-2.

 

Metrics

HW5 requires you to compute diversification metrics (P-IA@10, P-IA@20, and αNDCG@20). The trec_eval software and relevance judgments that you used for HW1-HW4 cannot be used to calculate these metrics. Instead, ndeval and diversification relevance judgements are used to calculate these metrics.

You can access ndeval through the ndeval service on the HW5 Testing web page (scroll down).

You can also run ndeval on your laptop. You will need the diversification relevance judgements. Do not use the diversification relevance judgements with trec_eval. The different tools use different relevance judgments (relevance vs. diversification).

Run the software: ndeval diversification_qrels teIn
There is one row per query. The final row ("amean") provides averages across the full query set.

 

FAQ

If you have questions not answered here, see the HW5 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on November 15, 2024

Jamie Callan