| 15-482/11-682 - Human Language Technologies Jamie Callan, Alan Black, and Alon Lavie |
Due: Tuesday, Oct 3, 2006 |
This assignment will give you experience with several representations and retrieval algorithms for Web documents. The documents have been stored in a set of inverted indexes that you can access via the Web, so you don't need to create or manage a large inverted index.
This assignment consists of three major parts:
The program runs a single retrieval experiment. A single retrieval experiment consists of producing retrieval results for a set of queries. A rough outline of the program is a loop over the following steps:
QueryID Q0 DocID Rank Score RunIDFor example:
q1 Q0 83653 1 0.430352 tfidf-BothNote that the document IDs must be Lemur internal IDs. Lemur internal IDs are what you see in the inverted lists, not what you see in the text of the document when you use the interactive interface.
q1 Q0 83858 2 0.430207 tfidf-Both
q1 Q0 83912 3 0.429673 tfidf-Both
: : : : : : : : : : : :
q2 Q0 8558610.469673tfidf-Both
Your program must implement two different ranking algorithms (step 4, above):
The index supports three representations: The ordinary body text representation, an anchor text representation, and a URL representation. For example, the anchor text inverted list for the word "volcano" indicates how often the word "volcano" occured in document X, and also how often it occurred in the anchortext of hyperlinks that point to document X. For example, the url text inverted list for the word "volcano" indicates how often the word "volcano" occured in document X, and also how often it occurred in the URL for document X. You will also be provided with a file containing the URL for each document, which is helpful for determining URL depth.
Your program must be able to rank documents using four different combinations of representations:
The document index is available at http://education.lemurproject.org/search/wt10g/lemur.cgi. The index is password-protected. For this class the username is 11-682 and the password is wt10gpswd.
The index is accessed via the lemur.cgi CGI program. The CGI program provides both an interactive search interface and an API for access by software. When accessing interactively, you can either i) wait for the browswer to prompt you for the username and password, or ii) you can embed the username and password in the URL, for example, http://11-682:wt10gpswd@education.lemurproject.org/search/wt10g/lemur.cgi (this does not work with Internet Explorer). When accessing the index via software, the procedure is slightly more complex, but not much. See the index access documentation for examples in perl and Java.
See the CGI interface documentation for information about the command language supported by the CGI program, and the types of data that can be obtained.Some of the information you need is available in the index, but is more easily or efficiently accessed via a bulk lookup. This information is provided in files that you download once, and store locally.
In case you are curious, the data is a collection of 1,692,096 Web documents crawled by the Internet Archive during 1997. Each query is a short description of a more complex (but unknown) information need. The information needs are documented here, in case you are curious about them, but your program is not permitted to use this information.
You will have access to one index. Stopwords remain in the index, and terms were not stemmed. You may perform query-based stopword removal if you wish, using this stopword list.
As mentioned above, your program must implement two ranking algorithms, using four combinations of representations. You must test your program on three query files - homepage queries, informational queries, and combined queries - to explore how each configuration works on different kinds of queries.
You must test each combination of algorithm, representation, and query file, so you will do a total of 24 retrieval experiments (2 algorithms x 4 representations x 3 query files = 24 experiments). For each experiment you must report the following information:
Precision at 10, 20, 30 documents, AvgPrec, MRR, and "wall clock" running time.
Your report must have two sections. The first section must describe the experiment. The second section must describe the software.
The section describing the experiment must contain at least the following information:
The section describing the software must contain at least the following information:
This homework requires you to use a shared resource - the Web servers that provide access to the document index. You are expected to demonstrate good manners and professional conduct in your use of these resources.
Do not issue multiple http requests in parallel. The servers are capable of handling a normal volume of requests. However, when students begin issuing multiple requests in parallel, the servers become overwhelmed, response times become very slow, and everyone suffers. Anyone violating this policy risks having their IP address blocked and points deducted from their homework grade.
If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.