Boolean Retrieval: FAQ

[an error occurred while processing this directive]

HW1 Frequently Asked Questions

Index
Test Cases and Grading
Efficiency
Miscellaneous

Index

Question: What is an md5 checksum and how do I create one?
Answer: md5 checksums are used to verify that a file has been downloaded correctly without corruption.
- Linux: Use the md5sum command.
- Windows: Open the command prompt. cd to the desired folder. Type "certutil -hashfile <filename> MD5".
- Mac: Use the md5 command from the terminal.
Question: Is there a way to know more about what is in the index? It might help us debug our software.
Answer: The QryEval zip file contains a simple InspectIndex utility that we use to examine index contents when we are developing our code. If you run it with no commandline arguments, it will print a usage messsage that tells you what parameters it takes.

Test Cases and Grading

Question: Is there a script to help me download the development ("Training") test cases for HW1?
Answer: The TA's provided an example script for Linux people.

Note that it isn't necessary for you to download all of the development test cases. You can uses the HW Testing Service if you prefer. However, it may be faster to compare your results to local copies of the gold standard results, especially when the web servers are busy.
Question: I'm having trouble getting trec_eval to accept my result file.
Answer: See a small working example. Download it and try it. If it works, the problem is your file, not the trec_eval program.
Question: How do I read the trec_eval output?
Answer: See this simplified copy of the Evaluation lecture slides. Or, see this abbreviated trec_eval manual.
Question: I'm not getting the results that I expected for #AND queries.
Answer: Make sure that you are using tokenizeString properly. It is essential that your system discard stopwords and do stemming, otherwise your query terms will not match terms in the index. Also, ensure that you are sorting your final output is sorted correctly and ties are broken by external document id in ascending order.
Question: Is there a way to automate the submission of .teIn files to the trec_eval Service on the HW1 Testing page?
Answer: Yes. The TAs have provided an example python script.
Question: How does the automatic grading determine grades?
Answer: Your trec_eval output is compared to our trec_eval output. If the rankings match exactly, the score is 100%. If the rankings do not match exactly, your MAP score is compared to our MAP score. Every 1% difference in MAP causes a 2% deduction in your score. A 1% difference in MAP might be caused by small differences in how document scores are calculated (e.g., roundoff errors), so if you are getting 98%, that's good enough (a 2% difference on 1 test isn't going to affect your grade). Larger differences usually indicate errors in how your query was constructed or how document scores are calculated.

Many of the problems that people bring to office hours are caused by software thats write incorrect trec_eval input files. The first document must be ranked 1, not 0. There cannot be random extra characters (e.g., commas) in the trec_eval input. Ties in score must be broken by a sort on the external document id. Be very careful about your trec_eval input.
Question: Are the tests on the HW1 Testing page the same tests that will be used for grading? If the tests are different, but if my software passes the tests on the HW1 Testing page, can I be certain that it will it also pass the tests use for grading?
Answer: When we do grading, your software will be tested on queries that you have not seen. Your software may behave differently on those queries. We will not try to break your software with strange cases, but we may test it on query structures that are slightly different from what was provided for testing (e.g., different nested query operators, more nested query operators). We strongly encourage you to do your own testing of your software, and to also test it with queries that are different from what we provided.
Question: What is a reasonable running time?
Answer: Your program will not be graded based on its speed unless it is unusually slow or fast. The Testing page shows the speed of Jamie's system on each test, as measured by the homework testing system.
Question: Is there a way to see what the original documents looked liked before indexing?
Answer: You can see a small sample of the collection (20 documents). Each document begins with a <DOC> tag and ends with a </DOC> tag.
Question: Is there a way to view the document text?
Answer: There are two ways.
- You can use the -list-termvectors feature of InspectIndex to get an approximation (you will need to sort the term occurrences by position).
- You can use the Lemur search service. This search service has a url-based query language that allows you to fetch the document text that the search engine indexed.
  
  Suppose that your system returns a document with an internal docid of 11442. Use Idx.getExternalDocid to find its external document id (clueweb09-en0007-02-02959). Enter the url http://boston.lti.cs.cmu.edu/classes/11-642/HW/data/lemur.cgi?e=clueweb09-en0007-02-02959 into your favorite browser. (Note that your external document id is the value for the parameter "e=".) Access to this search service is restricted, so either access it from a CMU IP address, or use VPN.
  
  The search engine returns the document that it indexed. You will see the HTTP headers and the raw HTML. It does not include the CSS or other external files often used to format HTML, so it may look uglier than you expect. This is typical input to a search engine. This is the text that was used to construct your index.
Question: Students are required to submit HW-Exp-xxx files that show how the experiments were conducted. What are common errors that prevent the homework testing software from being able to reproduce student work?
Answer:
- Missing, broken, or corrupt .param files.
- The filetype should be .param, but is actually .param.txt or .param.rtf. This problem also applies to .qry and other input files.
- .param files that reference incorrect filenames or files not contained in the submission.
Points are deducted if your work cannot be reproduced automatically.

Efficiency

Question: My software is slow, but I don't know why.
Answer: Try using Python's profiling capabilities to see where your software spends the most time. The last line of the QryEval.py file that was provided to you is shown below.

main()

Replace it with the following code.

import cProfile cProfile.run( 'main()', 'QryEval.profile' ) import pstats from pstats import SortKey p = pstats.Stats( 'QryEval.profile' ) p.strip_dirs().sort_stats(SortKey.CUMULATIVE).print_stats()

The result is a list of Python method calls sorted by the amount of cumulative time they take. See the Python pstats documentation for more details about profiling Python software.

Miscellaneous

Question: I am having problems running trec_eval on my Mac.
Answer: Macs do not trust executables downloaded from the web. Their default behavior is to attach a quarantine tag to the executable, which prevents it from running. You can add the executable to a whitelist as follows.

xattr -r -d com.apple.quarantine [your_path]/trec_eval

Another alternative is to compile your own executable, using source code from NIST's official trec_eval repo.
Question: What are the information needs that correspond to these queries?
Answer: NIST developed information needs that were of interest to its assessors and covered by the ClueWeb09 documents.
Question: What are the relevance judgments for these queries?
Answer: cw09a.adhoc.1-200.qrel.indexed.
Question: I can't download the sample ClueWeb09 documents or the Lucene stopwords list.
Answer: Those files have .txt extensions. Some students have reported that some Mac browsers discard the .txt file extension.

If the FAQ hasn't answered your question, please read the Homework Testing FAQ or search the Piazza forum to see if someone has already answered your question before you ask it.

[an error occurred while processing this directive]