15-493 - Information Retrieval and Web Mining
Jamie Callan
Yiming Yang
Due: Tuesday, October 20, 2009

Homework #2

The purpose of this assignment is to gain experience with four different document ranking algorithms and two types of query operators.

This assignment consists of three major parts:

  1. Write a program that will evaluate queries and produce a ranking of documents;
  2. Do some experiments with different methods of ranking documents; and
  3. Write a report describing your program, the experiments, and your conclusions.
The report is an important part of your grade. Leave enough time to do a good job on it.

The Program

Your program should run a single retrieval experiment. A single retrieval experiment is a loop over a set of queries.

For each query:

  1. Read one query from the query file.
  2. Parse the query to create a corresponding query tree in which the internal nodes are query operators and the leaves are index terms. An example is shown below.
     

    Your query parser should handle the following cases:
    1. If a query has no explicit query operator, default to the typical choice for the retrieval algorithm;
    2. Discard terms that are stopwords; and
    3. Handle hyphenated terms (e.g., near-death) and other ambiguous forms of punctuation (use your own judgment about how to handle them).
  3. Evaluate the query, using a depth-first, term-at-a-time strategy. The evaluation of a leaf node should fetch the inverted list for the index term, if one is available. Note that some query terms may not occur in the index.
  4. Sort the matching documents by their scores, in descending order.
  5. Write the information for the best 100 documents retrieved to a file. Use trec_eval format (shown below).

You must develop and compare four different ranking algorithms (step 3, above).

  1. Vector Space: The vector space retrieval model, using LNC.LTC weighting.
  2. The query likelihood retrieval model, with two-stage smoothing.
  3. Indri: The query likelihood retrieval model, with Dirichlet priors.
  4. Custom: Your own ideas.
All four ranking algorithms must be capable of processing unstructured queries using the standard method of combining scores for that algorithm (summation for vector space, probabilistic AND for query likelihood and Indri).

Your program must support two additional query operators for two of the ranking algorithms.

  1. Vector Space:
    1. #OR, using the p-norm algorithm.
    2. #NEAR/n, using the algorithm implemented in HW1.
  2. Indri:
    1. #OR, using Indri's probabilistic OR algorithm.
    2. #NEAR/n, using the algorithm implemented in HW1.
The query language should use a prefix-style query language. For example, the OR operator is defined as "#OR( )"; NEAR operators should be defined similarly "#NEAR/n".

Your program can be written in C++, Java, or the language you used for HW1. If you wish to use a different programming language, ask first.

Your program should be well written and provide clear documentation within the source code.

Data

The corpus is stored in an index on the lemurproject.org server. The index is accessed using a web-based service that provides both a simple interactive search interface and a simple software API. Your program will use the software API for interacting with the index.

You need to know four things to use the server-based index.

  1. The web service is located at http://education.lemurproject.org/search/rcv1.
  2. The index is password protected. See the instructor to obtain the username and password.
  3. Examples of how to issue http GET requests with an embedded username and password from programs in several different programming languages are provided at http://education.lemurproject.org/SoftwareAccess.php.
  4. The web service commands that access the document index are described at http://education.lemurproject.org/cgiinterface.php. This document describes how to fetch inverted lists, term-specific corpus statistics, and similar information. Note that the document IDs are internal document IDs of the index.

The vector-space retrieval model uses vector lengths to normalize the dot product for the cosine correlation. You could compute these yourself, but that is time-consuming, so they are provided for you at http://education.lemurproject.org/search/rcv1/rcv1-cosine.tar.gz. The file contains one tab-delimited line per document with two columns: the document ID and the cosine correlation value.

document_id vector_length

Note: The web service tends to get busy and less responsive as the homework deadline approaches. As soon as your query parser works, it is a good idea to download and store the inverted lists for each query token. It will be much faster and more reliable for your software to use inverted lists stored on your own machine. If you adopt this strategy, store each inverted list in a separate file and name the files <token>.inv (e.g., "apple.inv") so that the TA can run your software.

Evaluating Results

The output of your program must enable the trec_eval program to produce evaluation reports. The output should be in the form of:

QueryID Q0 DocID Rank Score RunID
For example:
501 Q0 83653 1 0.69run-1
501 Q0 83858 2 0.67 run-1
501 Q0 83912 3 0.63 run-1
: : : : : :
502 Q0 85586 1 0.78 run-1

The QueryID should correspond to the query ID of the query you are evaluating. Q0 is a required constant. The DocID should be the internal document ID from the index. The scores should be in descending order, to indicate that your results are ranked. The Run ID is an experiment identifier, for your convenience. It can be anything.

Use the trec_eval program to evaluate your results.

Note that the trec_eval is extremely intolerant of formatting errors. If you receive errors or don't get the results that you expect, the mostly likely reason is that the format is (slightly) wrong.

The Experiment

You must conduct six tests of your program.

  1. Unstructured: Run the query set against each retrieval algorithm, using unstructured queries (4 tests).
  2. Structured: Run the query set against the vector space and Indri algorithms, using structured queries (2 tests).

For each test (each query set) you must report the following information:

  1. Precision at 10, 20, and 30 documents;
  2. Mean average precision (MAP); and
  3. "wall clock" running time.

The Report and Source Code

You must turn in a written report, in Ascii text (txt), Microsoft Word, or pdf format. Include the following in your report, or in a separate file as indicated below. Use clear section headings and file names:

  1. The custom ranking algorithm: Describe the Custom algorithm that you developed in sufficient detail that someone else could implement it. Describe what it was intended to do.
  2. In a separate TEXT file - Sample unstructured results: For each retrieval method (4 methods), list the rank, document id, and score (one tuple per line) for the first 50 documents returned for unstructured versions of queries R101 and R102.
  3. Experimental results for unstructured queries: Present these in a tabular form so that it is easy to compare the results for each algorithm. Discuss your observations about the various algorithms, i.e., differences in how they performed, what worked well and didn't, patterns/trends you observed across the set of experiments, etc. Pay special attention to the differences between different types of retrieval models, differences between different types of smoothing methods, and the effectiveness of your custom algorithm.
  4. Structured queries: List your structured queries. For each query, provide a one sentence comment about your intent, i.e., why you thought that particular structure was a good choice.
  5. In a separate TEXT file - Sample structured results: For each retrieval method (2 methods), list the rank, document id, and score (one tuple per line) for the best 50 documents returned for the following structured queries.
    1. R101: #OR (economic espionage)
    2. R102: #OR (convicts #NEAR/2 (repeat offenders))
  6. Experimental results for structured queries: Present these in a tabular form so that it is easy to compare the results for each algorithm. Discuss your observations about the various algorithms, i.e., differences in how they performed, what worked well and didn't, patterns/trends you observed across the set of experiments, types of queries for which structure worked well, etc.
  7. The software implementation:
    1. Describe your design decisions and high-level software architecture.
    2. Describe major data structures (if any).
    3. Briefly discuss any programming tools or libraries that you used.
    4. Discuss the strengths and weaknesses of your design, and any problems that your system encountered. Now that you have done it, what did you do well, and what should you have done differently?

You must also turn in your source code, packaged as a .zip, .gz, or .tar file. Do NOT list your source code in the report. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.

Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.

FAQ

If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.


Copyright 2008, Carnegie Mellon University.
Updated on September 23, 2008.
Jamie Callan