| 11-741 - Information Retrieval and Web Mining Jamie Callan Yiming Yang |
Due: Tuesday, March 3, 2009 |
The purpose of this assignment is to gain experience with four different document ranking algorithms and two types of query operators.
This assignment consists of three major parts:
Your program should run a single retrieval experiment. A single retrieval experiment is a loop over a set of queries.
For each query:

You must develop and compare four different ranking algorithms (step 3, above).
Your program must support two additional query operators for two of the ranking algorithms.
With these operators, you should be able to directly use the structured queries you created for HW1 to test the Indri and Vector Space ranking methods in this homework. See if these structured queries perform any better than the other ranking methods with unstructured queries.
Your program can be written in C++, Java, or the language you used for HW1. If you wish to use a different programming language, ask first.
Your program should be well written and provide clear documentation within the source code.
The corpus is stored in an index on the lemurproject.org server. The index is accessed using a web-based service that provides both a simple interactive search interface and a simple software API. Your program will use the software API for interacting with the index.
You need to know four things to use the server-based index.
The vector-space retrieval model uses vector lengths to normalize the dot product for the cosine correlation. It is time consuming and inefficient to compute these yourself, so they are provided for you at http://education.lemurproject.org/search/rcv1/rcv1-cosine.tar.gz. The file contains one tab-delimited line per document with two columns: the document ID and the cosine correlation value.
| document_id | vector_length |
Note: The web service tends to get busy and less responsive as the homework deadline approaches. As soon as your query parser works, it is a good idea to download and store the inverted lists for each query token. It will be much faster and more reliable for your software to use inverted lists stored on your own machine. If you adopt this strategy, store each inverted list in a separate file and name the files <token>.inv (e.g., "apple.inv") so that the TA can run your software.
The output of your program must enable the trec_eval program to produce evaluation reports. The output should be in the form of:
| QueryID | Q0 | DocID | Rank | Score | RunID |
| 501 | Q0 | 83653 | 1 | 0.69 | run-1 |
| 501 | Q0 | 83858 | 2 | 0.67 | run-1 |
| 501 | Q0 | 83912 | 3 | 0.63 | run-1 |
| : | : | : | : | : | : |
| 502 | Q0 | 85586 | 1 | 0.78 | run-1 |
The QueryID should correspond to the query ID of the query you are evaluating. Q0 is a required constant. The DocID should be the internal document ID from the index. The scores should be in descending order, to indicate that your results are ranked. The Run ID is an experiment identifier, for your convenience. It can be anything.
Use the trec_eval program to evaluate your results.
Note that the trec_eval is extremely intolerant of formatting errors. If you receive errors or don't get the results that you expect, the mostly likely reason is that the format is (slightly) wrong.
You must conduct six tests of your program.
For each test (each query set) you must report the following information:
You must turn in a written report, in Ascii text (txt), Microsoft Word, or pdf format. Include the following in your report, or in a separate file as indicated below. Use clear section headings and file names:
You must also turn in your source code, packaged as a .zip, .gz, or .tar file. Do NOT list your source code in the report. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.
Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.
If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.