Carnegie Mellon University
Search Engines:
11-442 / 11-642
Spring 2026

HW1: Lexical Retrieval Design Guide

Development Environment
Architecture Overview
Parameter Files
New Capabilities
Implementation Sequence
Testing
Important Advice
FAQ

QryEval documentation
Generate/reset password

 

Development Environment

The homework assignments are done in Python using an Anaconda development environment with a standard set of libraries and tools. If you do not already have Anaconda on your computer, download and install it.

Next, define an Anaconda environment that contains the same libraries and tools that will be used to test and grade your software.

conda env create -f 11x42-26S-a.yml

Initialize the software development directory. This can be done in one of two ways.

  1. The simplest method is to download init_dir.zip or init_dir.tgz (4.4G compressed, 7.0G uncompressed). It creates and populates the following directory structure.

  2. Or, you can create it manually, using the files in the Source column below.

    Directory Contents    Source
    QryEval/ QryEval software .zip or .tgz
    INPUT_DIR/ Data files, other software Document index (.zip or .tgz)
    Relevance judgments (.qrel)
           LIB_DIR/      Java jar files .zip or .tgz
    OUTPUT_DIR/ Output from your software Initially empty
    TEST_DIR/ Public ("training") test cases .zip or .tgz

    The QryEval software, the homework parameter files, and the homework testing system expect this directory structure. When your software is tested, the current working directory will be the QryEval directory.

Finally, download the trec_eval executable, and put it in the QryEval/INPUT_DIR directory. There are different versions for Mac (FAQ); Linux; and Windows

At this point, you should be able to run one of the HW1 public ("training") cases.

     conda activate 11x42-26S-a
python QryEval.py TEST_DIR/HW1-Train-0.param

The output should be the following.

     -- Ranker: UnrankedBoolean --
22: #or( rick warren )
==> #OR(#SCORE(rick.body ) #SCORE(warren.body ) )
27: #or( starbucks )
==> #SCORE(starbuck.body )
Time: 4.2 secs

There should also be a file HW1-Train-0.teIn in the OUTPUT_DIR directory.

 

Architecture Overview

This assignment extends QryEval, a search engine written on top of Lucene, a widely-used open-source library that is the foundation of ElasticSearch, Solr, Pyserini, and other searche engines / toolkits. Lucene is written in Java, so your search engine is a combination of Python and Java software, however you will interact only with Python.

QryEval provides Python classes that implement a document-at-a-time architecture and access to Lucene indexes. See the QryEval documentation for details about classes and methods.

QryEval is designed to be highly configurable, which enables you to construct ranking pipelines of varying length and complexity. Every experiment is configured by a .param file that describes the queries, the document collection, and a sequence of tasks that your software must perform.

QryEval:main begins by reading the .param file, doing some initialization, and creating a batch. A batch is a dict that describes a set of queries and the work that has been done on them so far. Initially, the batch just consists of query ids and query strings.

        { "47": { "qstring": "Who is John Galt?" }, ... }

Next, QryEval:main enters a loop over the tasks described in the .param file. On each pass through the loop, a task is executed.

There are several main types of task. Think of a task as a class that may have subclasses.

Each task reads the batch as input, does some work, and possibly updates the batch. For example, a ranker task will add a ranking for each query.

        { "47": { "qstring": "Who is John Galt?", "ranking": [ (docid, score), ... ] }, ... }

Rerankers change scores and reorder a ranking. Agents use a ranking to produce some other result, for example, an answer.

        { "47": { "qstring": "Who is John Galt?", "ranking": [ (docid, score), ... ], "answer": "John Galt is ..." }, ... }

Output tasks write results to a file.

 

Parameter Files

A .param file is a json file that creates a (hierarchical) Python dict.

The .param file describes the queries, the document collection, and a pipeline of tasks that your software must perform. Thus, it has two distinct sections:

Example

The .param file below provides paths for the query file and the document index. Every .param file has these two key/value pairs. Then it defines a pipeline of two tasks. The first task uses the BM25 retrieval algorithm to produce a ranking for the query. The second task writes the ranking to a file.

    {
        "indexPath": "INPUT_DIR/index-cw09",
        "queryFilePath": "TEST_DIR/HW1-Train-2.qry",
        "task_1:ranker": {
            "type": "BM25",
            "outputLength": "1000",
            "BM25:k_1": "0.75",
            "BM25:b": "0.3"
        },
        "task_2:output": {
            "type": "trec_eval",
            "outputPath": "OUTPUT_DIR/HW1-Train-2.teIn",
            "outputLength": "1000"
        }
    }

Notice that there are two outputLength parameters. The Ranker outputLength sets the maximum number of documents returned by the ranker. The Output outputLength sets the maximum number of documents written to the trec_eval file. These may seem redundant. The need for different output lengths is more apparent in later homework, when the ranking pipeline becomes longer.

Names

Each task has a unique name (because a dict cannot have two keys with the same name). Task names have the following format:

      task_<id>:<task_type>

For example: "task_1:ranker" identifies a ranker task. The id is just a unique integer; it has no other purpose, and it is not necessarily a sequence number.

Each task is configured by a dict of parameters. Parameters that are specific to a particular type (subclass or implementation) of a task have a prefix. Parameters that apply to all types (subclasses or implementations) of a task have no prefix. For example, a ranker might have the following two parameters.

Later assignments will have many parameters. This naming scheme is intended to help you keep track of what they all apply to.

HW1 Parameters

Your software must support the following parameters.

See the HW1 Testing page for example parameter files.

 

New Capabilities

Your search engine must support the following capabilities. In the descriptions below, you must implement everything that is not already provided in QryEval.

  1. Retrieval models:

    1. Unranked Boolean retrieval: Every document that matches the query gets a score of 1. (Provided in QryEval.)
    2. Ranked Boolean retrieval: The score for matching a query term is its term frequency (tf) in the document.
    3. BM25 retrieval: The score for matching a query term is a tf.idf score.
  2. Query operators:

    1. OR: return a document if at least one of the query arguments occurs in the document. Use the MAX function to combine the scores from the query arguments. (Provided in QryEval.) Used by UnrankedBoolean and Ranked Boolean retrieval models.
    2. AND: return a document if all of the query arguments occur in the document. Use the MIN function to combine the scores from the query arguments. Used by UnrankedBoolean and Ranked Boolean retrieval models.
    3. NEAR/n: return a document if all of the query arguments occur in the document, in order, with no more than n-1 terms separating two adjacent terms. For example, #NEAR/2(a b c) matches "a b c", "a x b c", "a b x c", and "a x b x c", but not "a x x b c". The tf for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean, BM25). Used by all retrieval models.
    4. WINDOW/n: return a document if all of the query arguments occur in the document in any order within a window of n terms. For example, #WINDOW/5(a b c) matches "c b a", "b x a c", "a b x c", and "b x c x a", but not "a x x b b" or "a x b x x c". The tf for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean, BM25). Used by all retrieval models.
    5. SYN: return a document if it matches any of the query arguments. The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). (Provided in QryEval.) Used by all retrieval models.
    6. SUM: return a document if at least one of the query arguments occurs in the document. Use summation to combine the scores from the query arguments. Used by BM25.
    7. WSUM: return a document if at least one of the query arguments occurs in the document. Use weighted summation to combine the scores from the query arguments. Used by BM25.

  3. Default query operator: Unstructured queries (e.g., "apple pie recipes") must use a default query operator. #AND is the default operator for Unranked and Ranked Boolean retrieval models. #SUM is the default operator for the BM25 retrieval model.

    After you implement the #AND operator, you will need to change the default operator for the UnrankedBoolean retrieval model from #OR to #AND.

  4. Document fields: Your software must support the five fields provided by the index. (Provided in QryEval.)

    1. url: Terms extracted from the document url.
    2. keywords: Terms that the author provided in the <meta name="Keywords" content="..."> HTML tag, if any.
    3. inlink: Terms that other documents use in hyperlinks to this document, if any.
    4. title: Terms that the author provided in the <title> HTML element, if any.
    5. body: Terms extracted from the document body. This is the main document content.

    The field name (if any) is specified in the query using a suffix-based syntax of the form 'term.field', as in 'apple.title'. If the query does not specify a field (e.g., if the user types 'apple'), your software should default to using the 'body' field.

  5. trec_eval output: QryEval writes rankings in a format that enables trec_eval to generate standard metrics. You will need to make some improvements.

    Search results must be sorted by their scores (primary key, descending order) and their external document ids (secondary key, alphabetic order; used for tie scores). Do the sort before you pass results to the TeIn object.

    When a query retrieves no documents, write a line with a non-existent document id (e.g., 'dummy' or 'Nonexistent_Docid') and a score of 0.

    The last field of each line is a run identifier. It may be anything. You may find this convenient for labeling different experiments.

 

Implementation Sequence

HW1 requires you to understand and make several extensions to the QryEval search engine architecture. It can be a little difficult to know where to start. We recommend the following path.

Your Unranked Boolean retrieval model supports the #OR operator. Extend it to cover the #AND operator. There are four main steps.

Unranked Boolean is now finished.

Implement the Ranked Boolean retrieval model. There are four main steps.

Ranked Boolean is now finished.

Implement BM25. There are five main steps, but it is similar to what you did for RankedBoolean.

BM25 is now finished.

Implement the #NEAR operator. There are three main steps.

If you have done it correctly, #NEAR works for all three retrieval models.

Finally, implement the #WINDOW operator. There are three main steps.

If you have done it correctly, #WINDOW works for all three retrieval models.

 

Testing

There are three ways to test your software.

  1. Run the tests on your computer and check results locally, using the files that were downloaded above (in the TEST_DIR directory). This is the fastest option.

    Mac, Linux and Windows trec_eval executables are available. You can also download the trec_eval source from Github and compile your own. Run the command as shown below.

         trec_eval-9.0.4 -m num_q -m num_ret -m num_rel -m num_rel_ret -m map -m recip_rank -m P -m ndcg -m ndcg_cut -m recall.100,500,1000 cw09a.adhoc.1-200.qrel.indexed <test case>.teIn

    Store the output in a .teOut file. The reference results are in TEST_DIR. Probably your results are written to OUTPUT_DIR.

  2. Run the tests on your computer using the files that were downloaded above (in the TEST_DIR directory), save the results to a file in trec_eval format, and upload the file to the trec_eval web service. This is reasonably quick if you use a script to upload results to the service.

  3. Package your source code in a .zip file, and upload the file to the homework testing service. This is the slowest option, but we use this service to assess your software when we grade your homework, so we strongly suggest that you use it at least a couple of times before making your submission for grading.

These web services are accessed via the HW1 testing web page.

 

Important Advice

Parameter files: You must write parameter files for your experiments, which can be tedious. We strongly recommend that you write a script to generate .param files automatically, e.g., CSV file → set of .param files (each row is a parameter, each column is an experiment). Create a Python dict for each column, convert it to json, and write to a file. This is entirely optional, but highly recommended. You will be generating these files for five homework assignments. It gets tedious and error prone if you do it manually.

Producing tables: Each experiment produces several .teOut files. Manually copying results from multiple .teOut files to create tables for your reports is tedious and error prone. It is faster to use a script to merge .teOut files into a .csv file. A spreadsheet makes it easier for you to quickly recognize trends and create tables for your report. te2csv.py is simple software for this task; or, you may write your own software.

Automatic backups: Every semester, someone's laptop dies or is stolen, and they lose everything. Don't be that person. Best practice is to use an automatic backup service such as CrashPlan (used by CMU faculty and staff), Idrive, Backblaze, Arq, Acronis, or Carbonite. If something bad happens, your machine can be rebuilt in a few hours. If you don't have this already, now is a good time to start. It is a basic professional skill for Computer Scientists.

 

FAQ

If you have questions not answered here, see the HW1 FAQ.