| 15-493 - Information Retrieval and Web Mining Jamie Callan Yiming Yang |
Due: Friday, September 18, 2009 |
The purpose of this assignment is to learn about unranked Boolean retrieval and evaluation of unranked results.
This assignment consists of three major parts:
Your program should run a single retrieval experiment. A single retrieval experiment is a loop over a set of queries.
For each query:

Your program should support three query operators:
Your program can be written in C++ or Java. If you wish to use a different programming language, ask first.
Your program should be well written and provide clear documentation within the source code.
The corpus is stored in an index on the lemurproject.org server. The index is accessed using a web-based service that provides both a simple interactive search interface and a simple software API. Your program will use the software API for interacting with the index.
You need to know four things to use the server-based index.
Note: The web service tends to get busy and less responsive as the homework deadline approaches. As soon as your query parser works, you should download and store the inverted lists for each query token. It will be much faster and more reliable for your software to use inverted lists stored on your own machine. Store each inverted list in a separate file and name the files <token>.inv (e.g., "apple.inv") so that the TA can run your software.
The output of your program must enable the trec_eval program to produce evaluation reports. The output should be in the form of:
| QueryID | Q0 | DocID | Rank | Score | RunID |
| 501 | Q0 | 83653 | 1 | 1.0 | run-1 |
| 501 | Q0 | 83858 | 2 | 1.0 | run-1 |
| 501 | Q0 | 83912 | 3 | 1.0 | run-1 |
| : | : | : | : | : | : |
| 502 | Q0 | 85586 | 1 | 1.0 | run-1 |
The QueryID should correspond to the query ID of the query you are evaluating. Q0 is a required constant. The DocID should be the internal document ID from the index. The scores should all be equal, to indicate that the results are unranked. The Run ID is an experiment identifier, for your convenience. It can be anything.
Use the trec_eval program to evaluate your results.
Note that the trec_eval is extremely intolerant of formatting errors. If you receive errors or don't get the results that you expect, the mostly likely reason is that the format is (slightly) wrong. See the FAQ for more information.
You must conduct two tests of your program.
Although the goal for the Structured set is to beat the Baseline, don't obsess over achieving high accuracy. Unranked Boolean is not very accurate, so the results won't be great. However, do be sure that your program produces the correct results for each query set.
For each test (each query set) you must report the following information:
You must turn in a written report, in ASCII text (txt), Microsoft Word, or pdf format. Your report must contain the following sections, each clearly labeled as an independent section.
You must also turn in your source code, packaged as a .zip, .gz, or .tar file. The instructor will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class - the instructors care about your source code. The instructor will also run your code, so make sure that you include everything necessary to run it. You must include a file called runner.sh in your zipped source-code. This script file will contain the one-line command required to run your code on the given query set. e.g. if your C++ executable is called boolSearchEngine.exe, and arg1, arg2 etc are the required arguments for your search engine, then the runner.sh file will contain just one line of the form "boolSearchEngine.exe arg1 arg2 .... argN". Since the evaluation server will be different from your development server, you must also include a file called compiler.sh which will compile all the required code in your archive. This file should again contain just a one-line command necessary to compile your source-code. You should test your submission, by copying your archive to a system different from your development server, and running compiler.sh, followed by runner.sh. If it runs without glitches, then TA will also be able to evaluate it without special requirements.
Please make it easy for the instructor to see how you have addressed each of the requirements described for each section.
If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.