HW1: Lexical Retrieval
Due September 15, 11:59pm ET
Dev environment
Implementation
Design Guide
Testing
Advice
Experiments
Report
Submission
Grading
FAQ
Assignment Overview
The purpose of this assignment is to learn about Boolean retrieval and evaluation of retrieval results. This assignment consists of several parts.
- Set up an Anaconda software development environment with tools that you will need for several homework assignments.
- Add new retrieval capabilities to the QryEval search engine.
- Conduct experiments with your search engine.
- Write a report about your work.
- Upload your work to the course website for grading.
A password is required to access some course resources. You may obtain a password here. Use your Andrew email address as your userid.
1. Development Environment
The homework assignments are done in Python using an Anaconda development environment with a standard set of libraries and tools. If you do not already have Anaconda on your computer, download and install it.
Next, define an Anaconda environment that contains the same libraries and tools that will be used to test and grade your software.
conda env create -f 11x42-25F-a.yml
Initialize the software development directory. This can be done in one of two ways.
The simplest method is to download init_dir.zip or init_dir.tgz (4.4G compressed, 7.0G uncompressed). It creates and populates the following directory structure.
Or, you can create it manually, using the files in the Source column below.
Directory Contents Source QryEval/ QryEval software .zip or .tgz INPUT_DIR/ Data files, other software Document index (.zip or .tgz)
Relevance judgments (.qrel)LIB_DIR/ Java jar files .zip or .tgz OUTPUT_DIR/ Output from your software Initially empty TEST_DIR/ Public ("training") test cases .zip or .tgz The QryEval software, the homework parameter files, and the homework testing system expect this directory structure. When your software is tested, the current working directory will be the QryEval directory.
Finally, download the trec_eval executable, and put it in the QryEval/INPUT_DIR directory. There are different versions for Mac (FAQ); Linux; and Windows
At this point, you should be able to run one of the HW1 public ("training") cases.
conda activate 11x42-25F-a
|
The output should be the following.
-- Ranker: UnrankedBoolean --
|
There should also be a file HW1-Train-0.teIn in the OUTPUT_DIR directory.
2. New Retrieval Capabilities
This assignment adds several new capabilities to the QryEval search engine.
2.1. Search Engine Architecture
Your search engine must parse structured or unstructured queries, and use a document-at-a-time architecture (as discussed in Lectures 1, 2, and 3) to rank documents for each query. Search results are written to a file in trec_eval format. The sample QryEval software (QryEval-4.5.zip or QryEval-4.5.tgz) provides a working example.
Documents are stored in a Lucene index. Lucene is a widely-used open-source search engine, both on its own, and as the foundation of ElasticSearch, Solr, and Pyserini. Lucene is written in Java, so your search engine is a combination of Python and Java software.
The example software includes a set of Python classes that make it easier to develop a document-at-a-time architecture and access Lucene indexes. See the documentation for details.
2.2. Requirements
Your search engine must support the following capabilities.
Retrieval models:
- Unranked Boolean retrieval: Every document that matches the query gets a score of 1. (Provided in QryEval.)
- Ranked Boolean retrieval: The score for matching a query term is its term frequency (tf) in the document.
- BM25 retrieval: The score for matching a query term is a tf.idf score.
Query operators:
- OR: return a document if at least one of the query arguments occurs in the document. Use the MAX function to combine the scores from the query arguments. (Provided in QryEval.) Used by UnrankedBoolean and Ranked Boolean retrieval models.
- AND: return a document if all of the query arguments occur in the document. Use the MIN function to combine the scores from the query arguments. Used by UnrankedBoolean and Ranked Boolean retrieval models.
- NEAR/n: return a document if all of the query arguments occur in the document, in order, with no more than n-1 terms separating two adjacent terms. For example, #NEAR/2(a b c) matches "a b c", "a x b c", "a b x c", and "a x b x c", but not "a x x b c". The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). Used by all retrieval models.
- SYN: return a document if it matches any of the query arguments. The score for a matching document is either 1 (Unranked Boolean) or the number of matches (Ranked Boolean). (Provided in QryEval.) Used by all retrieval models.
- SUM: return a document if at least one of the query arguments occurs in the document. Use summation to combine the scores from the query arguments. Used by BM25.
- WSUM: return a document if at least one of the query arguments occurs in the document. Use weighted summation to combine the scores from the query arguments. Used by BM25.
Unstructured queries must use a default query operator. #AND
is the default operator for Unranked and Ranked Boolean retrieval
models. #SUM is the default operator for the BM25 retrieval model.
Note: After you implement the #AND operator, you will need
to change the default operator for the UnrankedBoolean retrieval
model from #OR to #AND.
Document fields: Your software must support the five fields provided by the index. (Provided in QryEval.)
- url: Terms extracted from the document url.
- keywords: Terms that the author provided in the <meta name="Keywords" content="..."> HTML tag, if any.
- inlink: Terms that other documents use in hyperlinks to this document, if any.
- title: Terms that the author provided in the <title> HTML element, if any.
- body: Terms extracted from the document body. This is the main document content.
The field name (if any) is specified in the query using a suffix-based syntax of the form 'term.field', as in 'apple.title'. If the query does not specify a field (e.g., if the user types 'apple'), your software should default to using the 'body' field.
See the Design Guide for advice about how to implement these capabilities.
2.3. Input
Your search engine must accept a single command-line argument that is a path to a parameter file in json format. The parameter file configures your software for testing. The Design Guide explains the format of parameter files.
Your software must support the following parameters.
- indexPath: The path to the Lucene index directory. Typically this will be something like "indexPath": "INPUT_DIR/index-cw09".
- queryFilePath: The path to the query file.
- task_<id1>:ranker: The value is a nested json object
that specifies the ranking algorithm and its parameters.
- type: Either "UnrankedBoolean", "RankedBoolean", or "BM25".
- outputLength: The maximum number of documents per query that the ranker should return. If this parameter is not specified, default to 1000.
- BM25:k_1: Values are real numbers >= 0.0.
- BM25:b: Values are real numbers between 0.0 and 1.0.
- task_<id2>:output: The value is a nested json object that
describes the type of output file to write.
- type: The value is "trec_eval" for HW1.
- outputPath: The path to the file where your software will write its search results.
- outputLength: The maximum number of documents per query to write to the trec_eval file. If this parameter is not specified, default to 1000.
In HW1, there are two outputLength parameters. The Ranker outputLength sets the maximum number of documents returned by the ranker. Output outputLength sets the maximum number of documents written to the trec_eval file. These may seem redundant. The need for different parameters is more apparent in later homework, when the ranking pipeline becomes longer.
See the HW1 Testing page for example parameter files.
2.4. Output
The search engine writes rankings in a format that enables trec_eval to produce evaluation reports. It already does this correctly. The information below is provided just to help you understand the format and contents.
Matching documents must be sorted by their scores (primary key, descending order) and their external document ids (secondary key, alphabetic order; used for tie scores).
When a query retrieves no documents, write a line with a non-existent document id (e.g., 'dummy' or 'Nonexistent_Docid') and a score of 0.
The last field of each line is a run identifier. It may be anything. You may find this convenient for labeling different experiments.
2.5. Testing
There are three ways to test your software.
Run the tests on your computer and check results locally, using the files that were downloaded above (in the TEST_DIR directory). This is the fastest option.
Mac, Linux and Windows executables are available. You can also download the trec_eval software from Github and compile your own. Run the command as shown below.
trec_eval-9.0.4 -m num_q -m num_ret -m num_rel -m num_rel_ret -m map -m recip_rank -m P -m ndcg -m ndcg_cut -m recall.100,500,1000 cw09a.adhoc.1-200.qrel.indexed <test case>.teIn
Store the output in a .teOut file. The reference results are in TEST_DIR. Probably your results are written to OUTPUT_DIR.Run the tests on your computer using the files that were downloaded above (in the TEST_DIR directory), save the results to a file in trec_eval format, and upload the file to the trec_eval web service.
Package your source code in a .zip file, and upload the file to the homework testing service. This is the slowest option, but we use this service to assess your software when we grade your homework, so we strongly suggest that you use it at least a couple of times before making your submission for grading.
These web services are accessed via the HW1 testing web page.
2.4. Advice
Parameter files: You must write parameter files for your experiments, which can be tedious. We recommend that you write a script to generate .param files automatically, e.g., CSV file (each row is a parameter, each column is an experiment) → set of .param files. Create a Python dict for each column, convert it to json, and write to a file. This is entirely optional, but highly recommended. You will be generating these files for five homework assignments.
Producing tables: Each experiment produces several .teOut files. Manually copying results from multiple .teOut files to create tables for your reports is tedious and error prone. It is faster to use a script to merge .teOut files into a .csv file. A spreadsheet makes it easier for you to quickly recognize trends and create tables for your report. te2csv.py is simple software for this task; or, you may write your own software.
Automatic backups: Every semester, someone's laptop dies or is stolen, and they lose everything. Don't be that person. Best practice is to use an automatic backup service such as CrashPlan (used by CMU faculty and staff), Idrive, Backblaze, Arq, Acronis, or Carbonite. If something bad happens, your machine can be rebuilt in a few hours. If you don't have this already, now is a good time to start. It is a basic professional skill.
3. Experiments
After your system is developed, you must conduct experiments that investigate the effectiveness of the different retrieval models and document representations.
Conduct your experiments with this bow query set. Some experiments require you to transform the bag-of-words queries into structured queries.
3.1. Experiment: Structure in Boolean Queries
The first experiment investigates the use of default query structure and other types of structure in Boolean queries. Compare retrieval accuracy using three query sets.
- Bag-of-words (BOW): Use the query set exactly as it was provided. (The retrieval model will supply its default query operator.)
- Bag-of-words (BOW) with #NEAR/3: Use a query set that uses the NEAR/3 query operator as the default query operator.
- Structured: Manually use the bag of
words query set to create a set of structured queries for
Boolean retrieval models. Your goal is to improve over
the baseline #AND and #NEAR/3 queries. You may decide how and
where to apply the query operators to improve results, but:
- Every query must contain some form of structure;
- Every query operator (AND, OR, NEAR, SYN) must be used in at least two queries; and
- Every field (title, body, url, keywords, inlink) must be used in at least two queries.
- FAQ:
- It isn't necessary to use all operators or fields in the same query.
- You are not restricted to terms in the original query. You may add synonyms or other vocabulary if you wish.
- You may delete terms that you think are not helpful.
Test each query set (BOW with AND, BOW with NEAR/3, Structured) with each retrieval algorithm (unranked Boolean, ranked Boolean), thus there will be 6 sets of experimental results.
Use trecEvalOutputLength=1000 for the experiments.
For each test (each query set), report the following information:
- MRR;
- Precision at 10, 20, and 30 documents;
- Mean average precision (MAP); and
- running time (minutes:seconds).
Undergraduates: Your goal is to get familiar with the two ranking algorithms and to get a little hands-on experience with forming structured queries and running reproducible experiments. Grading is based primarily on whether your queries demonstrate an understanding of the use of different query operators and document fields.
Graduate students: You have the same goals as the undergraduates, but you should also try to develop general strategies for forming queries that beat the unstructured baselines. Don't obsess over achieving high accuracy. Grading is based on the the quality of your strategies, not the accuracy of your results. Some interesting hypotheses may not work. Generate-and-test works well, but is a poor strategy unless one has infinite time.
3.2. Experiment: Document Representations
The documents in your index have five representations: body, title, url (terms extracted from the url), keywords (terms supplied by the document author), and inlink (terms used in hyperlinks from other documents to this document). The second experiment examines the value of these different fields using the Ranked Boolean and BM25 retrieval models. Test each representation individually using the bag-of-words queries. Thus there will be 10 sets of experimental results (2 retrieval models × 5 document representations).
3.3. Experiment: Multifield BM25
BM25 allows evidence from different representations of the document content (e.g., body, title, ...) to be combined to create a better estimate of how well a concept (e.g., a query term) or a query matches the document. For example, the unstructured query "The Time Traveler's Wife" can be transformed into the following structured query.
#WSUM (
0.2 #SUM(time.url traveler.url wife.url)
0.3 #SUM(time.title traveler.title wife.title)
0.5 #SUM(time.body traveler.body wife.body))
Transform the bow query set into a structured query that uses #WSUM to combine evidence from different document representations (fields). Test four weight combinations for using the different document representations (i.e., create four structured query sets). Note: In one structured query set, the same weights are used for all queries (i.e., do not form query-specific weights).
Each field (body, title, url, keywords, inlink) must be used (must have a non-zero weight) in at least one experiment.
4. The Report
11-442 students must submit a brief report that contains a statement of collaboration and originality, their queries, and their experimental results. A template is provided in Microsoft Word and pdf formats.
11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.
See the grading information document for information about how experiments and reports are graded.
5. Submission Instructions
Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW1-Report.pdf and place it in the same directory that contains your software (e.g., the directory that contains QryEval.py).
Reproducibility: Your submission must include files in the QryEval directory that enable all of your experiments to be reproduced. They must follow a naming convention so that the homework testing service can find them. The naming convention is approximately <hwid>-Exp-<experiment id>.<filetype>, for example, HW1-Exp-3.1a.qry and HW1-Exp-3.1a.param. See the report template (above) for guidance about how to name the files for each experiment.
Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.
6. Grading
Your report is uploaded to software that automatically parses it into sections to improve grading speed. Parsing fails if your report does not have the expected structure. A few points will be deducted if your report does not follow the structure of the report template.
Undergraduates: 88% autograding of your software, 12% for the queries and experimental results.
Graduate students: 50% autograding of your software, 50% for the queries, experimental results, and analysis of experimental results. See the grading information document for information about how experiments and reports are graded.
FAQ
If you have questions not answered here, see the HW1 FAQ and the Homework Testing FAQ.