15-482/11-682 - Human Language Technologies
Jamie Callan, Alan Black, and Alon Lavie
Due: Tuesday, Oct 3, 2006

HOMEWORK 1

This assignment will give you experience with several representations and retrieval algorithms for Web documents. The documents have been stored in a set of inverted indexes that you can access via the Web, so you don't need to create or manage a large inverted index.

This assignment consists of three major parts:

Each part of the assignment will receive roughly equal credit during grading. This places a lot of emphasis on the report, so plan accordingly.

The Program

The program runs a single retrieval experiment. A single retrieval experiment consists of producing retrieval results for a set of queries. A rough outline of the program is a loop over the following steps:

  1. Read a query from a query file;
  2. Parse the query (however you want) into a set of tokens;
  3. Get the inverted list for each token from the index, using the http protocol over the Web.
  4. Use the information from the term inverted lists and other resources to rank the documents; and
  5. Write to a log file the rank, document id, and score of the top 30 documents, using TREC format (shown below).
    QueryID Q0 DocID Rank Score RunID
    For example:
    q1 Q0 83653 1 0.430352 tfidf-Both
    q1 Q0 83858 2 0.430207 tfidf-Both
    q1 Q0 83912 3 0.429673 tfidf-Both
        :    :    :     :    :    :     :    :    :     :    :    :
    q2 Q0 8558610.469673tfidf-Both
    Note that the document IDs must be Lemur internal IDs. Lemur internal IDs are what you see in the inverted lists, not what you see in the text of the document when you use the interactive interface.
After all the queries in the query set are processed, determine Precision/Recall values by running trec_eval on the log file. Use this Web form to upload your log file. The Precision/Recall results will be returned between the <body> and </body> tags of a simple Web page. See the trec_eval documentation for an explanation of the metrics it returns.

Your program must implement two different ranking algorithms (step 4, above):

  1. TF.IDF: This algorithm considers how often a term occurs in the document and how how ofen it occurs in the collection. You could use the standard LNC.LTC weighting covered in class, for example, or you could use the language modeling approach.
  2. Custom: Your own ideas; anything you want (as long as you document it).

The index supports three representations: The ordinary body text representation, an anchor text representation, and a URL representation. For example, the anchor text inverted list for the word "volcano" indicates how often the word "volcano" occured in document X, and also how often it occurred in the anchortext of hyperlinks that point to document X. For example, the url text inverted list for the word "volcano" indicates how often the word "volcano" occured in document X, and also how often it occurred in the URL for document X. You will also be provided with a file containing the URL for each document, which is helpful for determining URL depth.

Your program must be able to rank documents using four different combinations of representations:

  1. Body text,
  2. Anchor text,
  3. Body text + anchor text, and
  4. Body text + anchor text + URL text + URL depth.

The Index

The document index is available at http://education.lemurproject.org/search/wt10g/lemur.cgi. The index is password-protected. For this class the username is 11-682 and the password is wt10gpswd.

The index is accessed via the lemur.cgi CGI program. The CGI program provides both an interactive search interface and an API for access by software. When accessing interactively, you can either i) wait for the browswer to prompt you for the username and password, or ii) you can embed the username and password in the URL, for example, http://11-682:wt10gpswd@education.lemurproject.org/search/wt10g/lemur.cgi (this does not work with Internet Explorer). When accessing the index via software, the procedure is slightly more complex, but not much. See the index access documentation for examples in perl and Java.

See the CGI interface documentation for information about the command language supported by the CGI program, and the types of data that can be obtained.

Some of the information you need is available in the index, but is more easily or efficiently accessed via a bulk lookup. This information is provided in files that you download once, and store locally.

In case you are curious, the data is a collection of 1,692,096 Web documents crawled by the Internet Archive during 1997. Each query is a short description of a more complex (but unknown) information need. The information needs are documented here, in case you are curious about them, but your program is not permitted to use this information.

The Experiments

You will have access to one index. Stopwords remain in the index, and terms were not stemmed. You may perform query-based stopword removal if you wish, using this stopword list.

As mentioned above, your program must implement two ranking algorithms, using four combinations of representations. You must test your program on three query files - homepage queries, informational queries, and combined queries - to explore how each configuration works on different kinds of queries.

You must test each combination of algorithm, representation, and query file, so you will do a total of 24 retrieval experiments (2 algorithms x 4 representations x 3 query files = 24 experiments). For each experiment you must report the following information:

Precision at 10, 20, 30 documents, AvgPrec, MRR, and "wall clock" running time.

The Report

Your report must have two sections. The first section must describe the experiment. The second section must describe the software.

The section describing the experiment must contain at least the following information:

  1. The Precision, AvgPrec, and running time statistics mentioned above;
  2. A description of the Custom algorithm that you developed, described in sufficient detail that someone else could implement it;
  3. Your comments about any patterns/trends you observed across the set of experiments, for example, about which algorithms or representations were effective; and
  4. Your comments about what worked well, what did not not, and what you learned from this assignment.

The section describing the software must contain at least the following information:

  1. a description of your design decisions and high-level software architecture;
  2. a description of major data structures (if any);
  3. any programming tools or libraries that you used;
  4. strengths and weaknesses of your design, and any problems that your system encountered; and
  5. software source code.

Good Manners and Professional Conduct

This homework requires you to use a shared resource - the Web servers that provide access to the document index. You are expected to demonstrate good manners and professional conduct in your use of these resources.

Do not issue multiple http requests in parallel. The servers are capable of handling a normal volume of requests. However, when students begin issuing multiple requests in parallel, the servers become overwhelmed, response times become very slow, and everyone suffers. Anyone violating this policy risks having their IP address blocked and points deducted from their homework grade.

Restrictions

  1. Your system must do this task entirely automatically. No manual intervention is allowed.
  2. You may use C, C++, Java, or perl to write your program. If you want to use a different language, check with the instructor first.
  3. You must write all of the software yourself.

FAQ

If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.


Copyright 2006, Carnegie Mellon University.
Updated on September 19, 2006.
http://www.cs.cmu.edu/~callan