Carnegie Mellon University
Search Engines:
11-442 / 11-642
Fall 2025

HW1: Lexical Retrieval Design Guide

Tasks
.param Files
Implementation sequence
FAQ

Overview

Your search engine is designed to be highly configurable, which enables you to construct ranking pipelines of varying length and complexity. Every experiment is configured by a .param file that describes the queries, the document collection, and a sequence of tasks that your software must perform.

QryEval:main begins by reading the .param file, doing some initialization, and creating a batch. A batch is a dict that describes a set of queries and the work that has been done on them so far. Initially, the batch just consists of query ids and query strings.

        { qid: { "qstring": "some query string" }, ... }

Next, QryEval:main enters a loop over the tasks described in the .param file. On each pass through the loop, a task is performed.

There are several main types of task. Think of a task as a class that may have subclasses.

Each task reads the batch as input, performs the task, and possibly updates the batch. For example, a ranker task will add a ranking for each query.

        { qid: { "qstring": "some query string", "ranking": [ (qid, score), ... ] }, ... }

Rerankers change scores and reorder a ranking. Agents use a ranking to produce an answer.

        { qid: { "qstring": "some query string", "ranking": [ (qid, score), ... ], "answer": "answer string" }, ... }

Output tasks write results to a file.

 

.param Files

A .param file is a json file that creates a (hierarchical) Python dict.

The .param file describes the queries, the document collection, and a pipeline of tasks that your software must perform. Thus, it has two distinct sections:

Example

The .param file below provides paths for the query file and the document index. Every .param file has these two key/value pairs. Then it defines a pipeline of two tasks. The first task uses the BM25 retrieval algorithm to produce a ranking for the query. The second task writes the ranking to a file.

    {
        "indexPath": "INPUT_DIR/index-cw09",
        "queryFilePath": "TEST_DIR/HW1-Train-2.qry",
        "task_1:ranker": {
            "type": "BM25",
            "outputLength": "1000",
            "BM25:k_1": "0.75",
            "BM25:b": "0.3"
        },
        "task_2:output": {
            "type": "trec_eval",
            "outputPath": "OUTPUT_DIR/HW1-Train-2.teIn",
            "outputLength": "1000"
        }
    }

Names

Each task has a unique name (because a dict cannot have two keys with the same name). Task names have the following format:

      task_<id>:<task_type>

For example: "task_1:ranker" identifies a ranker task. The id is just a unique integer; it has no other purpose, and it is not necessarily a sequence number.

Each task is configured by a dict of parameters. Parameters that are specific to a particular type (subclass or implementation) of a task have a prefix. Parameters that apply to all types (subclasses or implementations) of a task have no prefix. For example, a ranker might have the following two parameters.

Later assignments will have many parameters. This naming scheme is intended to help you keep track of what they all apply to.

 

Implementation Sequence

HW1 requires you to understand and make several extensions to the QryEval search engine architecture. It can be a little difficult to know where to start. We recommend the following path.

Your Unranked Boolean retrieval model supports the #OR operator. Extend it to cover the #AND operator. There are four main steps.

Unranked Boolean is now finished.

Implement the Ranked Boolean retrieval model. There are four main steps.

Ranked Boolean is now finished.

Implement BM25. There are five main steps, but it is similar to what you did for RankedBoolean.

BM25 is now finished.

Finally, implement the #NEAR operator. There are three main steps.

If you have done it correctly, #NEAR works for all three retrieval models.

 

FAQ

If you have questions not answered here, see the HW1 FAQ.