Lexical Retrieval

Search Engines:
11-442 / 11-642

HW1: Lexical Retrieval
Due February 3, 11:59pm ET

Assignment Overview

The purpose of this assignment is to learn about Boolean retrieval and evaluation of retrieval results. This assignment consists of several parts.

Set up an Anaconda software development environment with tools that you will need for several homework assignments.
Add new retrieval capabilities to the QryEval search engine.
Conduct experiments with your search engine.
Write a report about your work.
Upload your work to the course website for grading.

A password is required to access some course resources. You may obtain a password here. Use your Andrew email address as your userid.

1. Development Environment

The homework assignments are done in Python using an Anaconda development environment with a standard set of libraries and tools. If you do not already have Anaconda on your computer, download and install it.

Next, define an Anaconda environment that contains the same libraries and tools that will be used to test and grade your software.

conda env create -f 11x42-25S-a.yml

Initialize the software development directory. This can be done in one of two ways.

The simplest method is to download init_dir.zip or init_dir.tgz (4.4G compressed, 7.0G uncompressed). It creates and populates the following directory structure.

Or, you can create it manually, using the files in the Source column below.

Directory		Contents	Source
QryEval/		QryEval software	.zip or .tgz
	INPUT_DIR/	Data files, other software	Document index (.zip or .tgz) Relevance judgments (.qrel)
	LIB_DIR/	Java jar files	.zip or .tgz
	OUTPUT_DIR/	Output from your software	Initially empty
	TEST_DIR/	Public ("training") test cases	.zip or .tgz

The QryEval software, the homework parameter files, and the homework testing system expect this directory structure. When your software is tested, the current working directory will be the QryEval directory.

Finally, download the trec_eval executable, and put it in the QryEval/INPUT_DIR directory. There are different versions for Mac (FAQ); Linux; and Windows

At this point, you should be able to run one of the HW1 public ("training") cases.

conda activate 11x42-25S-a python QryEval.py TEST_DIR/HW1-Train-0.param

The output should be the following.

-- Ranker: UnrankedBoolean -- 22: #or( rick warren ) ==> #OR(#SCORE(rick.body ) #SCORE(warren.body ) ) 27: #or( starbucks ) ==> #SCORE(starbuck.body ) Time: 4.2 secs

There should also be a file HW1-Train-0.teIn in the OUTPUT_DIR directory.

2. New Retrieval Capabilities

This assignment adds several new capabilities to the QryEval search engine.

2.1. Search Engine Architecture

Your search engine must parse structured or unstructured queries, and use a document-at-a-time architecture (as discussed in Lectures 1, 2, and 3) to rank documents for each query. Search results are written to a file in trec_eval format. The sample QryEval software (QryEval-4.4.zip or QryEval-4.4.tgz) provides a working example.

Documents are stored in a Lucene index. Lucene is a widely-used open-source search engine, both on its own, and as the foundation of ElasticSearch, Solr, and Pyserini. Lucene is written in Java, so your search engine is a combination of Python and Java software.

The example software includes a set of Python classes that make it easier to develop a document-at-a-time architecture and access Lucene indexes. See the documentation for details.

2.2. Requirements

Your search engine must support the following capabilities.

Retrieval models:
1. Unranked Boolean retrieval: Every document that matches the query gets a score of 1. (Provided in QryEval.)
2. Ranked Boolean retrieval: The score for matching a query term is its term frequency (tf) in the document.
3. BM25 retrieval: The score for matching a query term is a tf.idf score.
Query operators:
1. OR: return a document if at least one of the query arguments occurs in the document. Use the MAX function to combine the scores from the query arguments. (Provided in QryEval.) Used by UnrankedBoolean and Ranked Boolean retrieval models.
2. AND: return a document if all of the query arguments occur in the document. Use the MIN function to combine the scores from the query arguments. Used by UnrankedBoolean and Ranked Boolean retrieval models.
3. NEAR/n: return a document if all of the query arguments occur in the document, in order, with no more than n-1 terms separating two adjacent terms. For example, #NEAR/2(a b c) matches "a b c", "a x b c", "a b x c", and "a x b x c", but not "a x x b c". The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). Used by all retrieval models.
4. SYN: return a document if it matches any of the query arguments. The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). (Provided in QryEval.) Used by all retrieval models.
5. SUM: return a document if at least one of the query arguments occurs in the document. Use summation to combine the scores from the query arguments. Used by BM25.
6. WSUM: return a document if at least one of the query arguments occurs in the document. Use weighted summation to combine the scores from the query arguments. Used by BM25.
Unstructured queries must use a default query operator. #AND is the default operator for Unranked and Ranked Boolean retrieval models. #SUM is the default operator for the BM25 retrieval model.
Note: After you implement the #AND operator, you will need to change the default operator for the UnrankedBoolean retrieval model from #OR to #AND.
Document fields: Your software must support the five fields provided by the index. (Provided in QryEval.)
1. url: Terms extracted from the document url.
2. keywords: Terms that the author provided in the <meta name="Keywords" content="..."> HTML tag, if any.
3. inlink: Terms that other documents use in hyperlinks to this document, if any.
4. title: Terms that the author provided in the <title> HTML element, if any.
5. body: Terms extracted from the document body. This is the main document content.
The field name (if any) is specified in the query using a suffix-based syntax of the form 'term.field', as in 'apple.title'. If the query does not specify a field (e.g., if the user types 'apple'), your software should default to using the 'body' field.

See the Design Guide for advice about how to implement these capabilities.

2.3. Input

Your search engine must accept a single command-line argument, which is a path to a parameter file in json format. The parameter file configures your software for testing.

Your software must support the following parameters.

indexPath: The path to the Lucene index directory. Typically this will be something like "indexPath": "INPUT_DIR/index-cw09".
queryFilePath: The path to the query file.
trecEvalOutputPath: The path to the file where your software will write its search results.
trecEvalOutputLength: The maximum number of documents per query to write to the trec_eval file. If this parameter is not specified, default to 1000.
ranker: The value is a nested json object that specifies the ranking algorithm and its parameters.
- outputLength: The maximum number of documents per query that the ranker should return. If this parameter is not specified, default to 1000.
- retrievalAlgorithm: Either "UnrankedBoolean", "RankedBoolean", or "BM25".
- BM25:k_1: Values are real numbers >= 0.0.
- BM25:b: Values are real numbers between 0.0 and 1.0.

In HW1, there are two output length parameters. ranker.outputLength sets the maximum number of documents returned by the ranker. trecEvalOutputLength sets the maximum number of documents written to the trec_eval file. These may seem redundant. The need for different parameters is more apparent in HW3-HW5, when the ranking pipeline becomes longer.

See the HW1 Testing page for example parameter files.

2.4. Output

The search engine writes rankings to the trecEvalOutputPath file in a format that enables trec_eval to produce evaluation reports. It already does this correctly. The information below is provided just to help you understand the format and contents.

Matching documents must be sorted by their scores (primary key, descending order) and their external document ids (secondary key, alphabetic order; used for tie scores).

When a query retrieves no documents, write a line with a non-existent document id (e.g., 'dummy' or 'Nonexistent_Docid') and a score of 0.

The last field of each line is a run identifier. It may be anything. You may find this convenient for labeling different experiments.

2.5. Testing

There are three ways to test your software.

Run the tests on your computer and check results locally, using the files that were downloaded above (in the TEST_DIR directory). This is the fastest option.

Mac, Linux and Windows executables are available. You can also download the trec_eval software from Github and compile your own. Run the command as shown below.

trec_eval-9.0.4 -m num_q -m num_ret -m num_rel -m num_rel_ret -m map -m recip_rank -m P -m ndcg -m ndcg_cut -m recall.100,500,1000 cw09a.adhoc.1-200.qrel.indexed <test case>.teIn

Store the output in a .teOut file. The reference results are in TEST_DIR. Probably your results are written to OUTPUT_DIR.
Run the tests on your computer using the files that were downloaded above (in the TEST_DIR directory), save the results to a file in trec_eval format, and upload the file to the trec_eval web service.
Package your source code in a .zip file, and upload the file to the homework testing service. This is the slowest option, but we use this service to assess your software when we grade your homework, so we strongly suggest that you use it at least a couple of times before making your submission for grading.

These web services are accessed via the HW1 testing web page.

3. Experiments

After your system is developed, you must conduct experiments that investigate the effectiveness of the different retrieval models and document representations.

Conduct your experiments with this bow query set. Some experiments require you to transform the bag-of-words queries into structured queries.

2.1. Experiment: Structure in Boolean Queries

The first experiment investigates the use of default query structure and other types of structure in Boolean queries. Compare retrieval accuracy using three query sets.

Bag-of-words (BOW): Use the query set exactly as it was provided. (The retrieval model will supply its default query operator.)
Bag-of-words (BOW) with #NEAR/3: Use a query set that uses the NEAR/3 query operator as the default query operator.
Structured: Manually use the bag of words query set to create a set of structured queries for Boolean retrieval models. Your goal is to improve over the baseline #AND and #NEAR/3 queries. You may decide how and where to apply the query operators to improve results, but:
- Every query must contain some form of structure;
- Every query operator (AND, OR, NEAR, SYN) must be used in at least two queries; and
- Every field (title, body, url, keywords, inlink) must be used in at least two queries.
- FAQ:
  - It isn't necessary to use all operators or fields in the same query.
  - You are not restricted to terms in the original query. You may add synonyms or other vocabulary if you wish.
  - You may delete terms that you think are not helpful.
  You may examine NIST's information needs to help you form structured queries, if you wish.

Test each query set (BOW with AND, BOW with NEAR/3, Structured) with each retrieval algorithm (unranked Boolean, ranked Boolean), thus there will be 6 sets of experimental results.

Use trecEvalOutputLength=1000 for the experiments.

For each test (each query set), report the following information:

MRR;
Precision at 10, 20, and 30 documents;
Mean average precision (MAP); and
running time (minutes:seconds).

Undergraduates: Your goal is to get familiar with the two ranking algorithms and to get a little hands-on experience with forming structured queries and running reproducible experiments. Grading is based primarily on whether your queries demonstrate an understanding of the use of different query operators and document fields.

Graduate students: You have the same goals as the undergraduates, but you should also try to develop general strategies for forming queries that beat the unstructured baselines. Don't obsess over achieving high accuracy. Grading is based on the the quality of your strategies, not the accuracy of your results. Some interesting hypotheses may not work. Generate-and-test works well, but is a poor strategy unless one has infinite time.

2.2. Experiment: Document Representations

The documents in your index have five representations: body, title, url (terms extracted from the url), keywords (terms supplied by the document author), and inlink (terms used in hyperlinks from other documents to this document). The second experiment examines the value of these different fields using the Ranked Boolean and BM25 retrieval models. Test each representation individually using the bag-of-words queries. Thus there will be 10 sets of experimental results (2 retrieval models × 5 document representations).

2.3. Experiment: Multifield BM25

BM25 allows evidence from different representations of the document content (e.g., body, title, ...) to be combined to create a better estimate of how well a concept (e.g., a query term) or a query matches the document. For example, the unstructured query "The Time Traveler's Wife" can be transformed into the following structured query.

  #WSUM (
    0.2 #SUM(time.url   traveler.url   wife.url)
    0.3 #SUM(time.title traveler.title wife.title)
    0.5 #SUM(time.body  traveler.body  wife.body))

Transform the bow query set into a structured query that uses #WSUM to combine evidence from different document representations (fields). Test four weight combinations for using the different document representations (i.e., create four structured query sets). Note: In one structured query set, the same weights are used for all queries (i.e., do not form query-specific weights).

Each field (body, title, url, keywords, inlink) must be used (must have a non-zero weight) in at least one experiment.

2.4. Advice

Reproducibility: Your submission must include files in the QryEval directory that enable all of your experiments to be reproduced. They must follow a naming convention so that the homework testing service can find them. The naming convention is approximately <hwid>-Exp-<experiment id>.<filetype>, for example, HW1-Exp-3.1a.qry and HW1-Exp-3.1a.param. See the report template (below) for guidance about how to name the files for each experiment.

Parameter files: Pro tip: You must write parameter files for your experiments, which can be tedious. You may find it helpful to write a script to generate .param files automatically, e.g., CSV file (each column is a parameter, each row is an experiment) → set of .param files. Create a Python structure for each row, convert it to json, and write to a file. This is entirely optional.

4. The Report

11-442 students must submit a brief report that contains a statement of collaboration and originality, their queries, and their experimental results. A template is provided in Microsoft Word and pdf formats.

11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

See the grading information document for information about how experiments and reports are graded.

5. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW1-Report.pdf and place it in the same directory that contains your software (e.g., the directory that contains QryEval.py).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

6. Grading

Your report is uploaded to software that automatically parses it into sections to improve grading speed. Parsing fails if your report does not have the expected structure. A few points will be deducted if your report does not follow the structure of the report template.

Undergraduates: 88% autograding of your software, 12% for the queries and experimental results.

Graduate students: 50% autograding of your software, 50% for the queries, experimental results, and analysis of experimental results. See the grading information document for information about how experiments and reports are graded.

FAQ

If you have questions not answered here, see the HW1 FAQ and the Homework Testing FAQ.

Jamie Callan

HW1: Lexical Retrieval Due February 3, 11:59pm ET