15-493 - Information Retrieval and Web Mining
Jamie Callan
Yiming Yang
Due: Friday, September 18, 2009

Homework #1

The purpose of this assignment is to learn about unranked Boolean retrieval and evaluation of unranked results.

This assignment consists of three major parts:

  1. Write a program that will evaluate queries and perform a retrieval process;
  2. Perform evaluation on the results of the retrieval process; and
  3. Write a report describing your program, the evaluation and your conclusions.
The report is an important part of your grade. Leave enough time to do a good job on it.

The Program

Your program should run a single retrieval experiment. A single retrieval experiment is a loop over a set of queries.

For each query:

  1. Read one query from the query file.
  2. Parse the query to create a corresponding query tree in which the internal nodes are query operators and the leaves are index terms. An example is shown below.
     

    Your query parser should handle the following cases:
    1. If a query has no explicit query operator, default to OR;
    2. Discard terms that are stopwords; and
    3. Handle hyphenated terms (e.g., near-death) and other ambiguous forms of punctuation (use your own judgment about how to handle them).
  3. Evaluate the query, using a depth-first, term-at-a-time strategy. The evaluation of a leaf node should fetch the inverted list for the index term, if one is available. Note that some query terms may not occur in the index.
  4. Write the information for the first 100 documents retrieved to a file. Use trec_eval format (shown below). If more than 100 documents match the query (which will happen often), list the last 100 documents (these will usually be the most recent).

Your program should support three query operators:

  1. OR - return a document if at least one of the query arguments occurs in the document;
  2. AND - return a document if all of the query arguments occur in the document; and
  3. NEAR/n - return a document if all of the query arguments occur in the document, in order, with no more than n-1 terms separating two adjacent terms. For example, #NEAR/2(a b c) matches "a b c", "a x b c", "a b x c", and "a x b x c", but not "a x x b c".
The query language should use a prefix-style query language. For example, the OR operator is defined as "#OR( )"; the AND and NEAR operators should be defined similarly (i.e. "#AND" and "#NEAR/n" ).

Your program can be written in C++ or Java. If you wish to use a different programming language, ask first.

Your program should be well written and provide clear documentation within the source code.

Data

The corpus is stored in an index on the lemurproject.org server. The index is accessed using a web-based service that provides both a simple interactive search interface and a simple software API. Your program will use the software API for interacting with the index.

You need to know four things to use the server-based index.

  1. The web service is located at http://education.lemurproject.org/search/rcv1.
  2. The index is password protected. See the instructor to obtain the username and password.
  3. Examples of how to issue http GET requests with an embedded username and password from programs in several different programming languages are provided at http://education.lemurproject.org/SoftwareAccess.php.
  4. The web service commands that access the document index are described at http://education.lemurproject.org/cgiinterface.php. This document describes how to fetch inverted lists, term-specific corpus statistics, and similar information. Note that the document IDs are internal document IDs of the index. Make sure you replace sample with rcv1 in the URLs so that you are using the correct corpus index!

Note: The web service tends to get busy and less responsive as the homework deadline approaches. As soon as your query parser works, you should download and store the inverted lists for each query token. It will be much faster and more reliable for your software to use inverted lists stored on your own machine. Store each inverted list in a separate file and name the files <token>.inv (e.g., "apple.inv") so that the TA can run your software.

Evaluating Results

The output of your program must enable the trec_eval program to produce evaluation reports. The output should be in the form of:

QueryID Q0 DocID Rank Score RunID
For example:
501 Q0 83653 1 1.0run-1
501 Q0 83858 2 1.0 run-1
501 Q0 83912 3 1.0 run-1
: : : : : :
502 Q0 85586 1 1.0 run-1

The QueryID should correspond to the query ID of the query you are evaluating. Q0 is a required constant. The DocID should be the internal document ID from the index. The scores should all be equal, to indicate that the results are unranked. The Run ID is an experiment identifier, for your convenience. It can be anything.

Use the trec_eval program to evaluate your results.

Note that the trec_eval is extremely intolerant of formatting errors. If you receive errors or don't get the results that you expect, the mostly likely reason is that the format is (slightly) wrong. See the FAQ for more information.

The Experiment

You must conduct two tests of your program.

  1. Baseline: Run the query set exactly as it was provided, i.e., using only the default #OR operator for each query. This is your baseline run.
  2. Structured: Create a set of structured queries that uses the three query operators (#OR, #AND, #NEAR/n). Your goal is to improve over the baseline #OR queries. You may decide how and where to apply the query operators to improve results, but you must use each query operator in at least 20 queries.

Although the goal for the Structured set is to beat the Baseline, don't obsess over achieving high accuracy. Unranked Boolean is not very accurate, so the results won't be great. However, do be sure that your program produces the correct results for each query set.

For each test (each query set) you must report the following information:

  1. Precision at 10, 20, and 30 documents;
  2. Mean average precision (MAP); and
  3. "wall clock" running time.

The Report and Source Code

You must turn in a written report, in ASCII text (txt), Microsoft Word, or pdf format. Your report must contain the following sections, each clearly labeled as an independent section.

  1. Query parsing choices: Describe your decisions about how natural language queries should be represented in a structured query format, for example, what delimited tokens, and how ambiguous punctuation (dashes, slashes, etc) was handled. This section should not be a description of software architecture. For most people it is probably brief.
  2. Structured queries: List your structured queries. For each query, provide a one sentence comment about your intent, i.e., why you thought that particular structure was a good choice.
  3. Sample results: List the document ids (one per line) for the first 50 documents (ordered by increasing document IDs) returned for the following queries:
    1. R101: #NEAR/2 (economic espionage)
    2. R102: #OR (convicts #AND (repeat offenders))
  4. Experimental results: Present the complete set of experimental results. Include the precision and running time results described above.
  5. Analysis of results: Discuss your observations about unranked Boolean retrieval. Discuss the effectiveness, strengths, and weaknesses of the query operators, and your success and failure at using them in queries. Feel free to include other comments about what you observed. This is your chance to show what you learned from this homework assignment - take this section very seriously.
  6. The software implementation:
    1. Describe your design decisions and high-level software architecture.
    2. Describe major data structures (if any).
    3. Briefly discuss any programming tools or libraries that you used.
    4. Discuss the strengths and weaknesses of your design, and any problems that your system encountered. Now that you have done it, what did you do well, and what should you have done differently?

You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class - the instructors care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it. You must include a file called runner.sh in your zipped source-code. This script file will contain the one-line command required to run your code on the given query set. e.g. if your C++ executable is called boolSearchEngine.exe, and arg1, arg2 etc are the required arguments for your search engine, then the runner.sh file will contain just one line of the form "boolSearchEngine.exe arg1 arg2 .... argN". Since the evaluation server will be different from your development server, you must also include a file called compiler.sh which will compile all the required code in your archive. This file should again contain just a one-line command necessary to compile your source-code. You should test your submission, by copying your archive to a system different from your development server, and running compiler.sh, followed by runner.sh. If it runs without glitches, then TA will also be able to evaluate it without special requirements.

Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.

FAQ

If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.


Copyright 2009, Carnegie Mellon University.
Updated on September 10, 2009.
Jamie Callan