Search Engines:
11-442 / 11-642
 
CMU logo
 

HW2: Pseudo Relevance Feedback Design Guide

Software Architecture
Efficiency
File Formats
Implementation Order

 

Software Architecture

This assignment uses a ranking pipeline. The first stage of the pipeline is your HW1 system, which obtains initial rankings from BM25, RankedBoolean, or an .inRank file. Most people will not need to make any changes to their HW1 system.

The QryEval search engine supports a reranking architecture that can call one or more rerankers, and an unfinished Reranker class to manage rerankers. Start your HW2 implementation by looking carefully at the Reranker class.

HW2 requires you to develop a pseudo relevance feedback reranker. Later homeworks require you to develop additional rerankers and test longer reranking pipelines. You will have less work if you are careful to separate functionality that is shared by all rerankers (e.g., managing and merging rankings) from functionality that is specific to a particular reranker. We recommend that you use a separate RerankWithPrf class for pseudo relevance feedback. Later assignments will have other RerankWithXxx classes.

Conceptually, pseudo relevance feedback is relatively simple.

  foreach doc_i in initial ranking:
    foreach term_j in doc_i:
      update the score of term_j
  use the top n terms to create a learned_query
  write the learned_query to the .qryOut file
  use the learned_query to retrieve documents
  write the ranking to the .teIn file

Use the TermVector index (described below) to find the terms that are in a particular document.

Discard Terms

The index contains terms that cause problems for the query parser, other parts of your system, or the grading software. Your query expansion software must ignore any candidate expansion term that contains a period (e.g., "clipart.com"), a comma (e.g., "2,000"), or a non-ASCII term.

Retrieval with the Expanded Query

Using the expanded query to retrieve documents requires minimal effort. When the RerankWithPrf class is initialized, it can use its parameters to allocate a new Ranker object configured for the second retrieval. Use its get_rankings method to generate the new rankings.

The TermVector Index

Query expansion algorithms select expansion terms from the vocabulary that occurs in the top-ranked retrieved documents. Thus, your software must have rapid access to the terms in a document. The TermVector index provides access to this information. QryEval's Idx.getTermVector method returns an object that provides access to Lucene's TermVector index.

The TermVector index provides more capabilities than you need for HW2. Really, all you care about is stems, a list of the terms that occur in the document, and stemsLength(), the length of stems.

Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).

 

Efficiency

Your system will be faster if it caches constant values, for example, the number of documents in the index and RSJ values for candidate terms. You do not need to spend massive effort on efficiency, but try to avoid repeatedly looking up the same value from the index. Most index lookups require Python to talk to Java, which is slow.

 

File Formats

This assignment uses two new file formats: .inRank files and .qryOut files. The new file formats are described below.

.inRank files

Rankings may be computed dynamically, as in HW1, or read from an .inRank file. QryEval already has the ability to read rankings from .inRank files, although it was not used in HW1. No additional work is required.

.inRank files allow us to test your pseudo relevance feedback algorithm independently of your BM25 implementation, just in case your BM25 implementation is not perfect. It also improves the speed of your experiments by allowing you to produce initial rankings once and reuse them many times instead of producing the same initial rankings for multiple experiments.

The .teIn and .inRank file formats are identical; the .teIn file produced by a BM25 experiment can be the .inRank input for a PRF experiment.

.qryOut files

If the reranker_1:prf:expansionQueryFile= parameter is included, your software must write a copy of each expanded query to the specified file. Each query is written to a single line in the following format.

    qid1: learned-query1
    qid2: learned-query2
     :      :

For example:

    51: #sum (obama family white tree politics ...)
    92: #sum (french lick indiana ...)
          :              :

The file should contain only the learned query.

The expansion terms may be in lower, mixed, or upper case. Unnecessary tab and space characters are ignored. Sort terms from least important (start of sequence) to most important (end of sequence); break ties lexically (e.g., "cat" occurs lexically before "dog").

 

 

Implementation Order

HW2 has several moving pieces, but each of them is fairly simple. We recommend the following development path.

 

FAQ

If you have questions not answered here, see the HW3 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on September 23, 2024

Jamie Callan