Search Engines:
11-442 / 11-642
 
CMU logo
 

HW2: Query Expansion
Due Sep 30, 11:59pm ET

 

Assignment Overview

The purpose of this assignment is to gain experience with query expansion algorithms and the TermVector index data structure. This assignment consists of several parts.

 

1. New Retrieval Capabilities

Ranking with query expansion requires your system to have two new capabilities: i) a reranking architecture, and ii) pseudo relevance feedback to create new queries.

1.1. Search Engine Architecture

The QryEval search engine supports a reranking architecture and provides an initial, unfinished Reranker class. HW2 requires you to develop a pseudo relevance feedback reranker. The pseudo relevance feedback reranker needs to support the following capabilities.

Use the Okapi pseudo relevance feedback algorithm.

Use the field specified by the prf:expansionField= parameter for pseudo relevance feedback. If the parameter is not specified, default to the body field.

Note: Your software must support two different sources for the initial document ranking: Your HW2 system, and the file specified by prf:initialRankingFile. This allows us to isolate the effects of the query expansion algorithm from how well your HW2 system worked.

See the Design Guide for advice about how to implement these capabilities.

Your source code must run within our homework testing service.

1.2. Parameters

Your software must support all of the parameters used in previous homework, as well as the new parameters described below.

1.3. Output

Your software must write search results to a file in trec_eval input format, as it did for previous homework. No change is required.

Your software must also write .qryOut files for grading. See the Design Guide for details.

1.4. Testing

You have the same three options for testing your software that you had for HW1.

The web services are accessed via the HW2 testing web page.

If you choose to do most of your testing locally on your own machine, use the following files.

2. Experiments

Conduct three experiments that investigate the effects of pseudo relevance feedback parameters on retrieval accuracy and efficiency, as described below.

win:tie:loss: You must report the number of queries that win, tie, or lose under query expansion, as measured using relative MAP difference with the original queries.
     (MAPexpanded - MAPoriginal) / MAPoriginal
A query wins if the relative difference ≥ 2.0% and loses if the relative difference ≤ -2.0%. Otherwise, it is a tie.

A new query set is used for HW2.

Experiment 1: Documents and Terms

Examine the effects of varying the number of feedback documents and query expansion terms on pseudo relevance feedback. Compare Compare eight (num_docs, num_term) combinations with an unstructured BM25 retrieval baseline.

Set BM25 parameters based on your experience in HW1. Use the same values of b and k_1 for all of the configurations. All experiments should return 1,000 documents.

Experiment 2: Initial Ranking Quality

Examine pseudo relevance feedback accuracy when different initial types of initial rankings are used. Select one of your configurations from Experiment 1 to serve as a baseline. Test four other initial rankings. One of the initial rankers must be Ranked Boolean, and one must be BM25 using title field queries. Select two other initial rankers that investigate your own ideas.

Experiment 3: Pseudo Relevance Feedback for Other Fields

The third experiment varies the prf:expansionField to investigate the value of using pseudo relevance feedback to form queries for fields other than body. Select one of your configurations from Experiment 1 or Experiment 2 to serve as a baseline. Develop four other configurations based on your experience in previous experiments. At least one configuration must produce queries for the title field.

Note: It is not necessary for the initial retrieval and the pseudo relevance feedback query to use the same field. For example, you could do retrieval using queries from field <x>, and use pseudo relevance feedback to form queries for field <y>.

2.4. Advice

As suggested for HW1, you may find it faster to use a script to generate .param files automatically.

Each of the experiments may take several minutes to run. In our development, we use a script that runs several experiments in the background while we do other tasks. You may find it more pleasant to do something similar.

 

3. The Report

11-442 students must submit a brief report that contains a statement of collaboration and originality and their experimental results. A template is provided in Microsoft Word and pdf formats.

11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

See the grading information document for information about how experiments and reports are graded.

 

4. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW2-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.java).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

 

FAQ

If you have questions not answered here, see the HW2 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on September 26, 2024

Jamie Callan