| 11-741 - Information Retrieval Jamie Callan Yiming Yang |
Due: Thursday Feb 12, 2009 |
The purpose of this assignment is to learn about unranked Boolean retrieval and evaluation of unranked results.
This assignment consists of three major parts:
Your program should run a single retrieval experiment. A single retrieval experiment is a loop over a set of queries.
For each query:

Your program should support three query operators:
Your program can be written in C++ or Java. If you wish to use a different programming language, ask first.
Your program should be well written and provide clear documentation within the source code.
Note that the search engine developed in HW1 is extended in HW2 to rank documents. In HW1, each document either matches or does not match each query term. In HW2, each document that matches a query term gets a score based on information in the inverted list, and information about the corpus as a whole (e.g., average document length). HW2 also requires you to experiment with several different approaches to calculating scores. Keep this in mind when you build your software for HW1. Try to develop general and flexible software so that it is easy to extend for HW2. You are not graded on this; it is just intended as helpful advice.
The corpus is stored in an index on the lemurproject.org server. The index is accessed using a web-based service that provides both a simple interactive search interface and a simple software API. Your program will use the software API for interacting with the index.
You need to know four things to use the server-based index.
Note: The web service tends to get busy and less responsive as the homework deadline approaches. As soon as your query parser works, it is a good idea to download and store the inverted lists for each query token. It will be much faster and more reliable for your software to use inverted lists stored on your own machine. If you adopt this strategy, you can store each inverted list in a separate file and name the files <token>.inv (e.g., "apple.inv") and make sure to pack these files for submission as well, so that the TA can run your software.
The output of your program must enable the trec_eval program to produce evaluation reports. The output should be in the form of:
| QueryID | Q0 | DocID | Rank | Score | RunID |
| 501 | Q0 | 83653 | 1 | 1.0 | run-1 |
| 501 | Q0 | 83858 | 2 | 1.0 | run-1 |
| 501 | Q0 | 83912 | 3 | 1.0 | run-1 |
| : | : | : | : | : | : |
| 502 | Q0 | 85586 | 1 | 1.0 | run-1 |
The QueryID should correspond to the query ID of the query you are evaluating. Q0 is a required constant. The DocID should be the internal document ID from the index. The scores should all be equal, to indicate that the results are unranked. The Run ID is an experiment identifier, for your convenience. It can be anything.
Use the trec_eval program to evaluate your results.
Note that the trec_eval is extremely intolerant of formatting errors. If you receive errors or don't get the results that you expect, the mostly likely reason is that the format is (slightly) wrong.
You must conduct two tests of your program.
Baseline: Run the query set exactly as it was provided, i.e., using only the default #OR operator for each query. This is your baseline run.
Structured: Create a set of structured queries that uses the three query operators (#OR, #AND, #NEAR/n). Your goal is to improve over the baseline #OR queries. You may decide how and where to apply the query operators to improve results, but you must use each query operator in at least 20 queries. The book notes ".. AND operators tends to produce high precision .. while using OR operators gives low precision .. and it is difficult or impossible to find a satisfactory middle ground". In your report, you might want to comment on whether you had a similar experience in your queries.
Although the goal for the Structured set is to beat the Baseline, don't obsess over achieving high accuracy. However, do be sure that your program produces the correct results for each query set.
For each test (each query set) you must report the following information:
You must turn in a written report, in Ascii text (txt), Microsoft Word, or pdf format. Your report must contain the following sections, each clearly labeled as an independent section.
You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class lesson - the instructor will actually care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it.
Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.
If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.