HW1: Lexical Retrieval Design Guide
Overview
Your search engine is designed to be highly configurable, which enables you to construct ranking pipelines of varying length and complexity. Every experiment is configured by a .param file that describes the queries, the document collection, and a sequence of tasks that your software must perform.
QryEval:main begins by reading the .param file, doing some initialization, and creating a batch. A batch is a dict that describes a set of queries and the work that has been done on them so far. Initially, the batch just consists of query ids and query strings.
{ qid: { "qstring": "some query string" }, ... }
Next, QryEval:main enters a loop over the tasks described in the .param file. On each pass through the loop, a task is performed.
There are several main types of task. Think of a task as a class that may have subclasses.
- ranker: Use a query to rank documents.
- reranker: Use a query and a ranking of documents to create a new ranking of documents.
- rewriter: Use a query and perhaps a ranking of documents to create a new query.
- agent: Use a query and a ranking to accomplish some task.
- output: Write results to a file.
Each task reads the batch as input, performs the task, and possibly updates the batch. For example, a ranker task will add a ranking for each query.
{ qid: { "qstring": "some query string", "ranking": [ (qid, score), ... ] }, ... }
Rerankers change scores and reorder a ranking. Agents use a ranking to produce an answer.
{ qid: { "qstring": "some query string", "ranking": [ (qid, score), ... ], "answer": "answer string" }, ... }
Output tasks write results to a file.
.param Files
A .param file is a json file that creates a (hierarchical) Python dict.
The .param file describes the queries, the document collection, and a pipeline of tasks that your software must perform. Thus, it has two distinct sections:
- Several key/value pairs that provide paths to the queries and documents, and
- A sequence ("pipeline") of key/dict pairs that describe the tasks that your software must perform.
Example
The .param file below provides paths for the query file and the document index. Every .param file has these two key/value pairs. Then it defines a pipeline of two tasks. The first task uses the BM25 retrieval algorithm to produce a ranking for the query. The second task writes the ranking to a file.
{
"indexPath": "INPUT_DIR/index-cw09",
"queryFilePath": "TEST_DIR/HW1-Train-2.qry",
"task_1:ranker": {
"type": "BM25",
"outputLength": "1000",
"BM25:k_1": "0.75",
"BM25:b": "0.3"
},
"task_2:output": {
"type": "trec_eval",
"outputPath": "OUTPUT_DIR/HW1-Train-2.teIn",
"outputLength": "1000"
}
}
Names
Each task has a unique name (because a dict cannot have two keys with the same name). Task names have the following format:
task_<id>:<task_type>
For example: "task_1:ranker" identifies a ranker task. The id is just a unique integer; it has no other purpose, and it is not necessarily a sequence number.
Each task is configured by a dict of parameters. Parameters that are specific to a particular type (subclass or implementation) of a task have a prefix. Parameters that apply to all types (subclasses or implementations) of a task have no prefix. For example, a ranker might have the following two parameters.
- outputLength: All rankers produce a ranked list of some maximum length.
- BM25:k_1: The ranker that uses the BM25 retrieval algorithm needs the k_1 parameter (but other rankers do not).
Later assignments will have many parameters. This naming scheme is intended to help you keep track of what they all apply to.
Implementation Sequence
HW1 requires you to understand and make several extensions to the QryEval search engine architecture. It can be a little difficult to know where to start. We recommend the following path.
Your Unranked Boolean retrieval model supports the #OR operator. Extend it to cover the #AND operator. There are four main steps.
- Create the QrySopAnd class. Copy the QrySopOr class and change the logic of the matching criteria (docIteratorHasMatch) from "match any argument" to "match all arguments".
- Consider whether you need to adjust how scores are calculated.
- Modify the QryParser to recognize #AND operators in queries.
- Change the default query operator from #OR to #AND.
Unranked Boolean is now finished.
Implement the Ranked Boolean retrieval model. There are four main steps.
- Create a RetrievalModelRankedBoolean class. Copy the RetrievalModelUnrankedBoolean class and make changes as necessary.
- Extend the Ranker class to support RetrievalModelRankedBoolean.
- Modify the #SCORE operator to generate scores for the Ranked Boolean retrieval model. #SCORE needs to produce a score for the current matching document. A #SCORE operator always has one argument (._args[0]) that is a QryIopXxx operator. Look at the QryIop class to see what functions all QryIopXxx query operators support. One of them gives you access to tf and term locations.
- Depending on how you implemented #OR and #AND score calculations, you may need to make adjustments to QrySopOr and QrySopAnd to support the RankedBoolean retrieval model.
Ranked Boolean is now finished.
Implement BM25. There are five main steps, but it is similar to what you did for RankedBoolean.
- Create a RetrievalModelBM25 class. Note that the BM25 retrieval model has parameters (b, k1). The retrieval model is a convenient place to store them because it is passed through the retrieval pipeline.
- Extend the Ranker class to support RetrievalModelBM25.
- Modify the #SCORE operator to generate scores for the BM25 retrieval model.
- Create the QrySopSum and QrySopWsum classes.
- Modify the QryParser to recognize #SUM and #WSUM in queries.
BM25 is now finished.
Finally, implement the #NEAR operator. There are three main steps.
- Use the QryIopSyn class to guide your implementation of a QryIopNear class.
- The evaluate method is the heart of each QryIopXxx operator. It uses the arguments (which are always QryIopXxx operators that have inverted lists) to produce a new inverted list.
- Modify the QryParser to recognize #NEAR operators in queries.
If you have done it correctly, #NEAR works for all three retrieval models.
FAQ
If you have questions not answered here, see the HW1 FAQ.