Carnegie Mellon University
Search Engines:
11-442 / 11-642
Fall 2025

HW2: Query Rewriting with PRF

Ranking Pipeline
Pseudo Relevance Feedback
Efficiency
File Formats
Implementation Order
Testing

 

Ranking Pipeline

Your HW1 system has a task-oriented ranking pipeline that supports four types of tasks (Ranker, Output, etc). This assignment adds a new type of task: Rewriter. The Rewriter class has the same high-level design as the other four tasks.

When the Rewriter type is "prf", it uses pseudo relevance feedback to rewrite the query. A later homework may add another type of query rewriter.

After the query is rewritten, another ranker is used to produce a new ranking. The ranking pipelines for this assignment have the following structure.

     RankerRewriterRankerOutput

The Ranker tasks are your HW1 system, which obtains initial rankings from BM25, RankedBoolean, or an .inRank file (see below). Most people will not need to make any changes to their HW1 system. The two Ranker tasks can be configured identically or differently depending on the experiment. Document your choices in your report.

Pseudo Relevance Feedback

Pseudo relevance feedback algorithms expect the task dict to include a Ranking field that contains an ordinary document ranking. They return a task dict in which each query's qstring field is rewritten with a new query. The Ranking field is unchanged. (A later task may overwrite the Ranking field, but that is unrelated to the Rewriter).

This assignment implements the Okapi and query likelihood (Indri, RM3) pseudo relevance feedback algorithms discussed in class.

Conceptually, pseudo relevance feedback is relatively simple.

  foreach doc_i in initial ranking:
    foreach term_j in doc_i:
      update the score of term_j
  use the top n terms to create a learned_query
  write the learned_query to the .qryOut file
  use the learned query to create an expanded query
  write the expanded query to the batch

Use the TermVector index (described below) to find the terms that are in a particular document.

When the query likelihood PRF algorithm is used with smoothing, the obvious implementation is a little inefficient. Efficiency can be improved with a clever design, but smoothing does not greatly improve PRF accuracy, so your PRF implementation will not use smoothing (i.e., lambda = 0, mu = 0 (see Lecture 5)).

The TermVector Index

Pseudo relevance feedback algorithms select terms from the vocabulary that occurs in the top-ranked retrieved documents. Thus, your software must have rapid access to the terms in a document. The TermVector index provides access to this information. QryEval's Idx.getTermVector method returns an object that provides access to Lucene's TermVector index.

The TermVector index provides more capabilities than you need for HW2. Really, all you care about is stems, a list of the terms that occur in the document, and stemsLength(), the length of stems.

Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).

Discard Terms

The index contains terms that cause problems for the query parser, other parts of your system, or the grading software. Your query expansion software must ignore any candidate expansion term that contains a period (e.g., "clipart.com"), a comma (e.g., "2,000"), or a non-ASCII term.

Parameters

Your software must support all of the parameters used in previous homework, as well as the new Rewriter parameters described below.

 

Efficiency

Your system will be faster if it caches constant values, for example, the number of documents in the index and RSJ values for candidate terms. You do not need to spend massive effort on efficiency, but try to avoid repeatedly looking up the same value from the index. Most index lookups require Python to talk to Java, which is slow.

 

File Formats

This assignment uses two new file formats: .inRank files and .qryOut files. The new file formats are described below.

.inRank files

Rankings may be computed dynamically, as in HW1, or read from an .inRank file. QryEval already has the ability to read rankings from .inRank files, although it was not used in HW1. No additional work is required.

.inRank files allow us to test your pseudo relevance feedback algorithm independently of your BM25 implementation, just in case your BM25 implementation is not perfect. It also improves the speed of your experiments by allowing you to produce initial rankings once and reuse them many times instead of producing the same initial rankings for multiple experiments.

The .teIn and .inRank file formats are identical; the .teIn file produced by a BM25 experiment can be the .inRank input for a PRF experiment.

.qryOut files

If the Rewriter has a prf:expansionQueryFile parameter, your software must write a copy of each learned query to the specified file. Each query is written to a single line in the following format.

    qid1: learned-query1
    qid2: learned-query2
     :      :

For example:

    51: #sum (obama family white tree politics ...)
    92: #sum (french lick indiana ...)
          :              :

The file should contain only the learned query.

The expansion terms may be in lower, mixed, or upper case. Unnecessary tab and space characters are ignored. Sort terms from least important (start of sequence) to most important (end of sequence); break ties lexically (e.g., "cat" occurs lexically before "dog").

 

 

Implementation Order

HW2 has several moving parts, but each is fairly simple. We recommend the following development path.

 

Testing

You have the same options for testing your software that you had for HW1.

The web services are accessed via the HW2 testing web page.

If you choose to do most of your testing locally on your own machine, use the following files: zip or tgz.

 

FAQ

If you have questions not answered here, see the HW2 FAQ and the Homework Testing FAQ.