Software Architecture
Efficiency
File Formats
Implementation Order
This assignment uses a ranking pipeline. The first stage of the pipeline is your HW1 system, which obtains initial rankings from BM25, RankedBoolean, or an .inRank file. Most people will not need to make any changes to their HW1 system.
The QryEval search engine supports a reranking architecture that can call one or more rerankers, and an unfinished Reranker class to manage rerankers. Start your HW2 implementation by looking carefully at the Reranker class.
HW2 requires you to develop a pseudo relevance feedback reranker. Later homeworks require you to develop additional rerankers and test longer reranking pipelines. You will have less work if you are careful to separate functionality that is shared by all rerankers (e.g., managing and merging rankings) from functionality that is specific to a particular reranker. We recommend that you use a separate RerankWithPrf class for pseudo relevance feedback. Later assignments will have other RerankWithXxx classes.
Conceptually, pseudo relevance feedback is relatively simple.
foreach doc_i in initial ranking:
foreach term_j in doc_i:
update the score of term_j
use the top n terms to create a learned_query
write the learned_query to the .qryOut file
use the learned_query to retrieve documents
write the ranking to the .teIn file
Use the TermVector index (described below) to find the terms that are in a particular document.
The index contains terms that cause problems for the query parser, other parts of your system, or the grading software. Your query expansion software must ignore any candidate expansion term that contains a period (e.g., "clipart.com"), a comma (e.g., "2,000"), or a non-ASCII term.
Using the expanded query to retrieve documents requires minimal effort. When the RerankWithPrf class is initialized, it can use its parameters to allocate a new Ranker object configured for the second retrieval. Use its get_rankings method to generate the new rankings.
Query expansion algorithms select expansion terms from the vocabulary that occurs in the top-ranked retrieved documents. Thus, your software must have rapid access to the terms in a document. The TermVector index provides access to this information. QryEval's Idx.getTermVector method returns an object that provides access to Lucene's TermVector index.
The TermVector index provides more capabilities than you need for HW2.
Really, all you care about is stems
, a list of the terms
that occur in the document, and stemsLength()
, the length
of stems
.
Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).
Your system will be faster if it caches constant values, for example, the number of documents in the index and RSJ values for candidate terms. You do not need to spend massive effort on efficiency, but try to avoid repeatedly looking up the same value from the index. Most index lookups require Python to talk to Java, which is slow.
This assignment uses two new file formats: .inRank files and .qryOut files. The new file formats are described below.
Rankings may be computed dynamically, as in HW1, or read from an .inRank file. QryEval already has the ability to read rankings from .inRank files, although it was not used in HW1. No additional work is required.
.inRank files allow us to test your pseudo relevance feedback algorithm independently of your BM25 implementation, just in case your BM25 implementation is not perfect. It also improves the speed of your experiments by allowing you to produce initial rankings once and reuse them many times instead of producing the same initial rankings for multiple experiments.
The .teIn and .inRank file formats are identical; the .teIn file produced by a BM25 experiment can be the .inRank input for a PRF experiment.
If the reranker_1:prf:expansionQueryFile= parameter is included, your software must write a copy of each expanded query to the specified file. Each query is written to a single line in the following format.
qid1: learned-query1 qid2: learned-query2 : :
For example:
51: #sum (obama family white tree politics ...) 92: #sum (french lick indiana ...) : :
The file should contain only the learned query.
The expansion terms may be in lower, mixed, or upper case. Unnecessary tab and space characters are ignored. Sort terms from least important (start of sequence) to most important (end of sequence); break ties lexically (e.g., "cat" occurs lexically before "dog").
HW2 has several moving pieces, but each of them is fairly simple. We recommend the following development path.
Start with the simplest Prf configuration first. Focus on test cases that i) use .inRank files, and ii) use just a few documents and terms. Get the .qryOut file working to simplify debugging.
Once some simple test cases works, shift to test cases that use many documents and many terms. These are slower (longer debugging cycles), but they help you diagnose small errors in scoring.
When all of the .inRank + PRF configurations work properly, do the test cases that use your initial ranking. If your HW1 system worked properly, there shouldn't be much work here.
The grade consists of two components: i) a grade for correct expansion terms in your .qryOut file, and ii) a grade for a correct reranking. Focus first on the expansion terms. After they are correct, focus on the reranking grade. The reranking grade is affected by the initial ranking, the learned query, the expanded query, and reranking by the new scores.
If you have questions not answered here, see the HW3 FAQ and the Homework Testing FAQ.
Copyright 2024, Carnegie Mellon University.
Updated on September 23, 2024