Search Engines:
11-442 / 11-642 / 11-742
 
CMU logo
 

HW2: Ranked Retrieval and Structured Queries
Due Feb 19, 11:59pm ET

The purpose of this assignment is to gain experience with two popular ranked-retrieval algorithms. This assignment consists of several parts.

 

1. New Retrieval Capabilities

This homework extends the search engine that you developed for Homework 1 by adding two new retrieval methods: i) BM25, and ii) Indri's query likelihood model with two-level smoothing. It also adds some new query operators to your query language.

1.1. Search Engine Architecture

Extend the search architecture developed for HW1. The main differences are two new retrieval models, two new rankers, adjustments to the query parser, and minor other changes to get it all to work.

Your software must satisfy the same constraints that were described for Homework 1. If you are extending the software that you developed for Homework 1, your software probably already satisfies these constraints.

1.2. Requirements

Your search engine must support the following capabilities.

  1. Ranking algorithms:

    1. BM25
    2. Indri

    BM25 and Indri require access to corpus statistics, for example, the number of documents and the average length of a particular type of field. Use the Idx class to fetch this information from the index. (See the documentation for details.)

      // Number of documents in the corpus
      System.out.println ("numdocs=" + Idx.getNumDocs());
    
      // Information about the url field
      System.out.println ("url:\t" +
                          "numdocs=" +
                          Idx.getDocCount ("url") + "\t" +
    	  	      "sumTotalTF=" +
    		      Idx.getSumOfFieldLengths("url") + "\t" +
    		      "avglen="+
    		      Idx.getSumOfFieldLengths("url") /
    		      (float) Idx.getDocCount ("url"));
        

    The first example fetches the number of documents in the corpus.

    The second example fetches the number of documents that have "url" fields, the total number of term occurrences in all url fields, and the average length of a url field.

    BM25 and Indri also use the length of a field to normalize term frequency (tf). Use the Idx class to fetch this information from the index.

      System.out.println (Idx.getFieldLength("body", 21));
        

    This example shows how to find the length of the "body" field in document 21. Note that the document id is the internal doc id.

    Indri requires you to compute P(qi|C). Implement this value as P(qi|C) = ctf(qi) / lengthterms(Cfield), where Cfield is the length of the 'field' part of the corpus (e.g., the length of the 'title' part of the corpus). Idx.getSumOfFieldLengths(field) provides that length.

  2. Query Operators:

    The BM25 retrieval method must support BM25's implied SUM query operator, as well as the SYN, NEAR/n, and WINDOW/n query operators (i.e., all QryIop operators). The SUM query operator is the default query operator for unstructured (bag of word) queries in BM25.

    The Indri retrieval method must support the Indri AND, WAND, WSUM, and WINDOW query operator, as well as the SYN and NEAR/n query operators. The AND query operator is Indri's default query operator for unstructured (bag of words) queries, but note that the score calculation will not be the same as your Ranked Boolean AND.

    The SYN and NEAR/n query operators used for Homework 1 can be reused for this homework. The SYN and NEAR/n query operator implementations are identical for all retrieval models (ranked Boolean, BM25, and Indri).

  3. Document Fields: Your software must (continue) to support the four fields provided in the index. This does not require any additional work by you.

See the Design Guide for advice about how to implement these capabilities.

1.3. Input

Your software must accept only one parameter which is a path to a parameter file. Your software must support the same parameters that were described for Homework 1. In addition, it must also support several new parameters, as described below.

1.4. Output

Your software must write search results in the same way that it did for Homework 1. The only difference in this homework is that your scores will be a little more interesting.

1.5. Testing

Use the HW2 Testing Page to access the trec_eval and homework testing services. You used similar services for Homework 1. Note: The link for these services is not the same as it was for Homework 1. The requirements for this homework are a little different, so the testing apparatus must also be a little different.

You may do local testing on your laptop, as you did for HW1. The HW2 test cases (zip, tgz) and grading files (zip, tgz). are available for download. Note: HW2 uses a new .qrel file that supports the more fine-grained query intents (e.g., 57.2) used in this assignment.

 

2. Experiments

You will conduct several experiments that compare different types of retrieval discussed in class. For each experiment, report the information described below.

Use trecEvalOutputLength=1000 for your experiments.

There are two query sets for this homework assignment:

Your experiments must be reproducible. See the HW1 reproducibility requirements for more information.

2.1. Experiment 1: Baselines (Everyone)

The first experiment uses short and long queries to compare your new retrieval models with your Ranked Boolean retrieval model. Its purpose is to help you understand the behavior of each model under baseline conditions.

2.2. Experiment 2: Sequential Dependency Model Queries (Everyone)

The Indri retrieval model is often used with sequential dependency model (SDM) queries. Develop SDM versions of both query sets. (We recommend writing a little script to generate them - it's easy.) Test four different combinations of weights for SDM queries.

FAQ: SDM experiments are done on the body field. SDM doesn't provide much value on short fields.

2.3. Experiment 3: Indri Parameter Adjustment (11-642)

The third experiment explores the effect of Indri's Dirichlet smoothing on short and long queries.

Test μ=1500 and four other values of μ. (Set λ=0.0.) Use your knowledge of how Dirichlet smoothing and Indri work to guide your choices. Use unstructured queries (the default query operator).

You do not need to explore the entire parameter space and you are not asked to find the optimal settings. However, you will need to explain and justify each choice, so give them some thought.

2.4. Experiment 4: Different Representations (11-642)

The last experiment examines how well each model does at integrating information from different fields. It is difficult to use the keywords and inlink fields effectively, so this experiment uses just the url, title, and body fields. Form queries as follows.

The url field is not used for Ranked Boolean because it is difficult to use effectively with an exact match model.

 

3. The Report

11-442 students must submit a brief report that contains a statement of collaboration and originality, and their experimental results. A template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

11-642 and 11-742 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

 

4. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW2-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.py).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

 

5. Grading

The grading requirements and advice are the same as for HW1.

 

FAQ

If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.


Copyright 2024, Carnegie Mellon University.
Updated on February 11, 2024

Jamie Callan