Search Engines:
11-442 / 11-642 / 11-742
 
CMU logo
 

HW1: Boolean Retrieval
Due February 5, 11:59pm ET

 

Assignment Overview

The purpose of this assignment is to learn about Boolean retrieval and evaluation of retrieval results. This assignment consists of several parts.

A password is required to access some course resources. You may obtain a password here. Use your Andrew email address as your userid.

 

1. Development Environment

The homework assignments are done in Python using an Anaconda development environment with a standard set of libraries and tools. If you do not already have Anaconda on your computer, download and install it.

Next, define an Anaconda environment that contains the same libraries and tools that will be used to test and grade your software.

conda env create -f 11x42-24Sa.yml

The QryEval software and the HW1-Train-xxx.param parameter files assume the following directory structure.

Directory Contents    Source
QryEval/ QryEval 4.2 and the subdirectories below You create
GRADING_DIR/ Results from the reference system Download zip or tgz
INPUT_DIR/ Data files, other software Download index (see below)
Download trec_eval Mac (FAQ); Linux; Windows)
Download .qrel file
       LIB_DIR/      Java jar files Download zip or tgz
OUTPUT_DIR/ Output from your software You create (initially empty)
TEST_DIR/ Public ("training") test cases Download zip or tgz

Note: Most .zip or .tgz files unpack into a directory that contains a set of files. Unzip it into the QryEval directory, not a subdirectory. For example, QryEval/TEST_DIR/TEST_DIR/HW1-Train-0.param or QryEval/QryEval/QryEval.py is a mistake. Correct configurations do not have two directories with the same name in a path.

 

2. New Retrieval Capabilities

This assignment creates a search engine called QryEval that implements Unranked and Ranked Boolean retrieval with several query operators. You may extend our initial QryEval search engine (less work) or develop your own from scratch (more work).

2.1. Search Engine Architecture

Your search engine must parse structured or unstructured queries, and use a document-at-a-time architecture (as discussed in Lectures 2, 3, and 4) to rank documents for each query. Search results are written to a file in trec_eval format. The sample QryEval software (QryEval-4.2.1.zip or QryEval-4.2.1.tgz) provides a working example.

Documents are stored in a Lucene index. Lucene is a widely-used open-source search engine, both on its own, and as the foundation of ElasticSearch, Solr, and Pyserini. Lucene is written in Java, so your search engine is a combination of Python and Java software.

The example software includes a set of Python classes that make it easier to develop a document-at-a-time architecture and access Lucene indexes. See the documentation for details.

2.2. Requirements

Your search engine must support the following capabilities.

  1. Ranking algorithms:

    1. Unranked Boolean retrieval: Every document that matches the query gets a score of 1. (Provided in QryEval.)
    2. Ranked Boolean retrieval: The score for matching a query term is its term frequency (tf) in the document.
  2. Query operators:

    1. OR: return a document if at least one of the query arguments occurs in the document. Use the MAX function to combine the scores from the query arguments. (Provided in QryEval.)
    2. AND: return a document if all of the query arguments occur in the document. Use the MIN function to combine the scores from the query arguments.
    3. NEAR/n: return a document if all of the query arguments occur in the document, in order, with no more than n-1 terms separating two adjacent terms. For example, #NEAR/2(a b c) matches "a b c", "a x b c", "a b x c", and "a x b x c", but not "a x x b c". The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean).
    4. SYN: return a document if it matches any of the query arguments. The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). (Provided in QryEval.)

    Unstructured queries must use a default query operator. #AND is the default operator for Unranked and Ranked Boolean retrieval models.
    Note: After you implement the #AND operator, you will need to change the default operator for the UnrankedBoolean retrieval model from #OR to #AND.

  3. Document fields: Your software must support the four fields provided by the index. (Provided in QryEval.)

    1. url: Terms extracted from the document url.
    2. keywords: Terms that the author provided in the <meta name="Keywords" content="..."> HTML tag, if any.
    3. title: Terms that the author provided in the <title> HTML element, if any.
    4. body: Terms extracted from the document body. This is the main document content.

    The field name (if any) is specified in the query using a suffix-based syntax of the form 'term.field', as in 'apple.title'. If the query does not specify a field (e.g., if the user types 'apple'), your software should default to using the 'body' field.

See the Design Guide for advice about how to implement these capabilities.

2.3. Input

Your search engine must accept a single command-line argument, which is a path to a parameter file in json format. The parameter file configures your software for testing.

Your software must support the following parameters.

In HW1, there are two output length parameters. ranker.outputLength sets the maximum number of documents returned by the ranker. trecEvalOutputLength sets the maximum number of documents written to the trec_eval file. These may seem redundant. The need for different parameters is more apparent in HW3-HW5, when the ranking pipeline becomes longer.

An example parameter file is provided. When we test your software, we will use parameter files in the same format. You will need to write your own parameter files to do your experiments.

2.4. Output

The search engine writes rankings to the trecEvalOutputPath file in a format that enables trec_eval to produce evaluation reports. It already does this correctly. The information below is provided just to help you understand the format and contents.

Matching documents must be sorted by their scores (primary key, descending order) and their external document ids (secondary key, alphabetic order; used for tie scores).

When a query retrieves no documents, write a line with a non-existent document id (e.g., 'dummy' or 'Nonexistent_Docid') and a score of 0.

The last field of each line is a run identifier. It may be anything. You may find this convenient for labeling different experiments.

2.5. Testing

There are three ways to test your software.

  1. Download the test cases (zip or tgz) and evaluation software to your computer. Run the tests on your computer and check results locally. This is the fastest option.

    Mac, Linux and Windows executables are available. You can also download the trec_eval software from Github and compile your own. Run the command as shown below.

         trec_eval-9.0.4 -m num_q -m num_ret -m num_rel -m num_rel_ret -m map -m recip_rank -m P -m ndcg -m ndcg_cut cw09a.adhoc.1-200.qrel.indexed <test case>.teIn

    Store the output in a .teOut file.

  2. Download the test cases (zip or tgz) to your computer. Run the tests on your computer, save the results to a file in trec_eval format, and upload the file to the trec_eval web service.

  3. Package your source code in a .zip file, and upload the file to the homework testing service. This is the slowest option, but we use this service to assess your software when we grade your homework, so we strongly suggest that you use it at least a couple of times before making your submission for grading.

These web services are accessed via the HW1 testing web page.

2.6. Data (Index)

The corpus is 552,682 documents from the ClueWeb09 dataset. Documents are stored in a Lucene index. The index is provided in two formats. Use whichever you are more comfortable with.

The index is large. Download it as soon as possible.

 

3. Experiments

You must do experiments with three query sets.

  1. Bag-of-words (BOW) with #AND: Use the query set exactly as it was provided.
  2. Bag-of-words (BOW) with #NEAR/3: Use the query set exactly as it was provided.
  3. Structured: Manually use the bag of words query set to create a set of structured queries. Your goal is to improve over the baseline #AND and #NEAR/3 queries. You may decide how and where to apply the query operators to improve results, but:

Undergraduates: Your goal is to get familiar with the two ranking algorithms and to get a little hands-on experience with forming structured queries and running reproducible experiments. Grading is based primarily on whether your queries demonstrate an understanding of the use of different query operators and document fields.

Graduate students: You have the same goals as the undergraduates, but you should also try to develop general strategies for forming queries that beat the unstructured baselines. Don't obsess over achieving high accuracy. Grading is based on the the quality of your strategies, not the accuracy of your results. Some interesting hypotheses may not work. Generate-and-test works well, but is a poor strategy unless one has infinite time.

Use trecEvalOutputLength=1000 for your experiments.

You must test each query set (BOW with AND, BOW with NEAR/3, Structured) with each retrieval algorithm (unranked Boolean, ranked Boolean), thus you will have 6 sets of experimental results.

For each test (each query set) you must report the following information:

  1. MRR;
  2. Precision at 10, 20, and 30 documents;
  3. Mean average precision (MAP); and
  4. running time (minutes:seconds).

Reproducibility: Your submission must include files in the QryEval directory that enable all of your experiments to be reproduced. They must follow a naming convention so that the homework testing service can find them. The naming convention is approximately <hwid>-Exp-<experiment id>.<filetype>, for example, HW1-Exp-3a.qry and HW1-Exp-3a.param. See the report template (below) for guidance about how to name the files for each experiment.

 

4. The Report

11-442 students must submit a brief report that contains a statement of collaboration and originality, their queries, and their experimental results. A template is provided in Microsoft Word and pdf formats.

11-642 and 11-742 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

See the grading information document for information about how experiments and reports are graded.

 

5. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW1-Report.pdf and place it in the same directory that contains your software (e.g., the directory that contains QryEval.py).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

 

6. Grading

Your report is uploaded to software that automatically parses it into sections to improve grading speed. Parsing fails if your report does not have the expected structure. A few points will be deducted if your report does not follow the structure of the report template.

Undergraduates: 88% autograding of your software, 12% for the queries and experimental results.

Graduate students: 50% autograding of your software, 50% for the queries, experimental results, and analysis of experimental results. See the grading information document for information about how experiments and reports are graded.

 

FAQ

If you have questions not answered here, see the HW1 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on January 31, 2024

Jamie Callan