HW2: Query Rewriting with PRF
Ranking Pipeline
Pseudo Relevance Feedback
Efficiency
File Formats
Implementation Order
Testing
Ranking Pipeline
Your HW1 system has a task-oriented ranking pipeline that supports four types of tasks (Ranker, Output, etc). This assignment adds a new type of task: Rewriter. The Rewriter class has the same high-level design as the other four tasks.
- The Rewriter class is initialized with a dict of parameters when it is created (i.e., task = Rewriter(task_parameters), and
- The Rewriter class has an execute method that accepts a batch object and returns a batch object.
When the Rewriter type is "prf", it uses pseudo relevance feedback to rewrite the query. A later homework may add another type of query rewriter.
After the query is rewritten, another ranker is used to produce
a new ranking. The ranking pipelines for this assignment have the
following structure.
Ranker → Rewriter → Ranker → Output
The Ranker tasks are your HW1 system, which obtains initial rankings from BM25, RankedBoolean, or an .inRank file (see below). Most people will not need to make any changes to their HW1 system. The two Ranker tasks can be configured identically or differently depending on the experiment. Document your choices in your report.
Pseudo Relevance Feedback
Pseudo relevance feedback algorithms expect the task dict to include a Ranking field that contains an ordinary document ranking. They return a task dict in which each query's qstring field is rewritten with a new query. The Ranking field is unchanged. (A later task may overwrite the Ranking field, but that is unrelated to the Rewriter).
This assignment implements the Okapi and query likelihood (Indri, RM3) pseudo relevance feedback algorithms discussed in class.
Conceptually, pseudo relevance feedback is relatively simple.
foreach doc_i in initial ranking:
foreach term_j in doc_i:
update the score of term_j
use the top n terms to create a learned_query
write the learned_query to the .qryOut file
use the learned query to create an expanded query
write the expanded query to the batch
Use the TermVector index (described below) to find the terms that are in a particular document.
When the query likelihood PRF algorithm is used with smoothing, the obvious implementation is a little inefficient. Efficiency can be improved with a clever design, but smoothing does not greatly improve PRF accuracy, so your PRF implementation will not use smoothing (i.e., lambda = 0, mu = 0 (see Lecture 5)).
The TermVector Index
Pseudo relevance feedback algorithms select terms from the vocabulary that occurs in the top-ranked retrieved documents. Thus, your software must have rapid access to the terms in a document. The TermVector index provides access to this information. QryEval's Idx.getTermVector method returns an object that provides access to Lucene's TermVector index.
The TermVector index provides more capabilities than you need for HW2.
Really, all you care about is stems, a list of the terms
that occur in the document, and stemsLength(), the length
of stems.
Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).
Discard Terms
The index contains terms that cause problems for the query parser, other parts of your system, or the grading software. Your query expansion software must ignore any candidate expansion term that contains a period (e.g., "clipart.com"), a comma (e.g., "2,000"), or a non-ASCII term.
Parameters
Your software must support all of the parameters used in previous homework, as well as the new Rewriter parameters described below.
- rewriter:
- prf:algorithm: The value is "okapi" or "rm3".
- prf:numDocs: The maximum number of documents to use for PRF.
- prf:numTerms: The maximum number of terms to include in the learned query
- prf:expansionField: The document field to use for learning new queries. If this field is not specified, default to the body field.
- prf:expansionQueryFile: Write the learned query to this file in the format described below..
- prf:rm3:origWeight: Use RM3 to combine the original query and the learned query using this weight for the original query.
Efficiency
Your system will be faster if it caches constant values, for example, the number of documents in the index and RSJ values for candidate terms. You do not need to spend massive effort on efficiency, but try to avoid repeatedly looking up the same value from the index. Most index lookups require Python to talk to Java, which is slow.
File Formats
This assignment uses two new file formats: .inRank files and .qryOut files. The new file formats are described below.
.inRank files
Rankings may be computed dynamically, as in HW1, or read from an .inRank file. QryEval already has the ability to read rankings from .inRank files, although it was not used in HW1. No additional work is required.
.inRank files allow us to test your pseudo relevance feedback algorithm independently of your BM25 implementation, just in case your BM25 implementation is not perfect. It also improves the speed of your experiments by allowing you to produce initial rankings once and reuse them many times instead of producing the same initial rankings for multiple experiments.
The .teIn and .inRank file formats are identical; the .teIn file produced by a BM25 experiment can be the .inRank input for a PRF experiment.
.qryOut files
If the Rewriter has a prf:expansionQueryFile parameter, your software must write a copy of each learned query to the specified file. Each query is written to a single line in the following format.
qid1: learned-query1
qid2: learned-query2
: :
For example:
51: #sum (obama family white tree politics ...)
92: #sum (french lick indiana ...)
: :
The file should contain only the learned query.
The expansion terms may be in lower, mixed, or upper case. Unnecessary tab and space characters are ignored. Sort terms from least important (start of sequence) to most important (end of sequence); break ties lexically (e.g., "cat" occurs lexically before "dog").
Implementation Order
HW2 has several moving parts, but each is fairly simple. We recommend the following development path.
Implement a Rewriter class (Rewriter.py) and a RewriteWithPrf class.
Start with Okapi and the simplest PRF configuration first. Focus on test cases that i) use .inRank files, and ii) use just a few documents and terms. Get the .qryOut file working to simplify debugging.
Once some simple test cases work, shift to test cases that use many documents and many terms. These are slower (longer debugging cycles), but they help you diagnose small errors in scoring.
When all of the .inRank + PRF configurations work properly, do the test cases that use your initial ranking. If your HW1 system worked properly, there shouldn't be much work here.
The grade consists of two components: i) a grade for correct expansion terms in your .qryOut file, and ii) a grade for a correct reranking. Focus first on the expansion terms. After they are correct, focus on the reranking grade. The reranking grade is affected by the initial ranking, the learned query, the expanded query, and reranking by the new scores.
Finally, do query likelihood PRF. Follow the same implementation sequence that you used for Okapi. The algorithms are similar, so software development should go quickly. The query likelihood queries may run more slowly than the Okapi queries though, so be sure to leave enough time.
Testing
You have the same options for testing your software that you had for HW1.
The web services are accessed via the HW2 testing web page.
If you choose to do most of your testing locally on your own machine, use the following files: zip or tgz.
FAQ
If you have questions not answered here, see the HW2 FAQ and the Homework Testing FAQ.