Learning to Rank: FAQ and Common Pitfalls

Search Engines:
11-442 / 11-642

HW3 Frequently Asked Questions and Common Pitfalls

Common Pitfall #1: When your software accesses the TermVector for a particular <docid, field> combination, an exception is generated. Usually this happens because your software tried to read information from an empty TermVector. Empty TermVectors get allocated when your software requests the contents of a field that a document does not have; for example, a document that does not have an inlink field. Use the TermVector's positionsLength or stemsLength method to determine whether it is empty before you try to access it.
Common Pitfall #2: When normalizing feature scores, it may be that all documents for that query have the same score for a feature (e.g. none of them have an inlink field). Be careful when normalizing your scores, so that you don't end up with Infinity or NaN features. In this case, you can set the scores to 0.
Common Pitfall #3: When a feature does not exist for a document (i.e. the document doesn't have a PageRank score, or the document doesn't have an inlink field), one method to deal with it is to set the feature to zero. However, you should make sure that you are setting the feature to zero after normalization. If you set the value before normalization, you may be inadvertently creating a new minimum or maximum value!
Common Pitfall #4: You can get the field length from Idx.getFieldLength() or termVector.positionsLength. Usually those values are identical, but not always. Idx.getFieldLength() is the actual length of the field. termVector.positionsLength is a reconstructed valued based on the maximum term location in the document. If the field ends with stopwords, termVector.positionsLength does not know about those stopwords, thus, it is a little shorter than Idx.getFieldLength(). Usually this difference has no effect on the feature value, but it can if the field is short and ends with stopwords. The reference software uses Idx.getFieldLength().
Common Pitfall #5: Make sure that you flush and close data files before trying to pass them to the LTR toolkits. Python does not always write things to disk immediately. You may think the file is complete on disk, but Python hasn't written all of it out yet, so the LTR toolkit sees only a partial file.
Common Pitfall #6: When no optimization metric is specified for RankLib, it defaults to ERR@10. However, the reference implementation defaults to MAP. Usually this difference won't have any effect on your results, but it does affect some training cases. If the .param file does not specify a metric2t for RankLib, default to MAP.
Running SVM^rank on a Mac: You need to Trust the svm_rank_classify and svm_rank_learn excutables and chmod a=x. Manually click open & select Trust.

If the FAQ hasn't answered your question, please search the Piazza forum to see if someone has already answered your question before you ask it.

Jamie Callan