Search Engines:
11-442 / 11-642

HW2: Learning to Rank Design Guide

Software Architecture
Efficiency
File Formats
Missing or Empty Fields
Normalizing Feature Values
Using the Toolkits
Implementation Order

Software Architecture

Each stage of the ranking pipeline processes a set of queries, which may seem unintuitive. For example, if the .qry file contains three queries, the initial ranker (e.g., BM25) ranks all of them and then passes the results to the learning-to-rank stage to rerank all of them. This architecture is a little simpler to build and faster to run than the more obvious query-at-a-time architecture.

Your LTR reranking software should be structured more-or-less as follows:

  # generate training data
  while a training query q is available:
    use QryParser.tokenizeString to stop & stem the query
    foreach document d in the relevance judgements for training query q:
      use Idx.getTermVector to get d's term vector
      generate a feature vector for <q, d>

    if toolkit is SVM^rank:
      normalize the feature values for query q to [0..1] 
    write the feature vectors to file

  # train a model
  call a toolkit to train a model

  # generate testing data for top n documents in initial BM25 ranking
  while (a test query q is available):
    use QryParser.tokenizeString to stop & stem the query
    foreach document d in the top n of initial ranking for q:
      use Idx.getTermVector to get d's term vector
      generate a feature vector for <q, d>

    if toolkit is SVM^rank:
      normalize the feature values for query q to [0..1] 
    write the feature vectors to file

  # re-rank test data (all queries)
  call the toolkit to produce scores for the test data
  read the new scores and use them to re-rank the initial rankings
  write the re-ranked results in trec_eval format

Feature vectors are generated more-or-less as follows.

  # generate feature vector
  create an empty feature vector
  use Idx.getAttribute to get the PageRank, spam, and Wikipedia features for d from the index
  fetch the term vector(s) for d
  calculate other features for <q, d>

The main advantage of this implementation is its efficiency and simplicity. Its main disadvantage is that you must reimplement BM25 to work from a term vector data structure. Assume BOW queries (no query operators) and that you are just reranking the top n documents provided by the initial ranker, which simplifies implementation considerably. Pseudo code is shown below for a BM25 implementation that uses term vectors.

  # tokenize the query before calling featureBM25
  queryStems = QryParser.tokenizeString (query)

  # generate a BM25 feature
  def featureBM25 (qid, queryStems, termVector):
      score = 0
      for stem in queryStems:
          if stem in termvector:
              score += BM25 term score for stem

      average score
      return score

Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).

Efficiency

Probably your HW1 system cached constant values, for example, the average title field length; and values that are constant for a particular query, for example, the RSJ score for "apple" in the "body" field. Caching is even more important for LTR rerankers because different features use the same values.

Caching can be a distraction and a source of frustrating errors. Aim for simplicity. Assume an infinite cache size for each query.

Each time your system looks up a value from the index or does a calculation, think about whether it will need that value again (e.g., a field length, an RSJ score, a termVector). The reference system uses "get_xxx" functions that check a cache before looking up or calculating a value. Its caches tend to be dicts or lists that are reset to empty at the start of each query. Simple.

Pro tip: Think about caching right at the start when you are building your first retrieval model feature. Your architecture will be simpler and cleaner. If you wait until you have a working (but slow) system, you will do twice as much work, and you are more likely to have little bugs that waste your time.

File Formats

Your system will use files to communicate with the machine learning toolkits. It will write a feature vector file to describe <q,d> pairs to the toolkit, and it will read a file of <q,d> scores to rerank documents.

Feature Vector File

Both toolkits use the same feature vector file format. Each line contains the feature vector for one <q, d> pair. Several example records are below.

2 qid:1 1:1 2:1 3:0 4:0.2 5:0 # clueweb09-en0000-48-24794
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en0011-93-16495
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # clueweb09-en0132-64-99898
0 qid:1 1:0 2:0 3:1 4:0.3 5:0 # clueweb09-en0370-54-24993
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en2108-01-81090
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # clueweb09-en0760-49-43194
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # clueweb09-en6303-47-69892
0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en6373-93-83391
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en2049-70-10797
1 qid:3 1:1 2:1 3:0 4:0.3 5:0 # clueweb09-en1703-12-05495
1 qid:3 1:1 2:0 3:0 4:0.4 5:1 # clueweb09-en8278-44-45799
0 qid:3 1:0 2:1 3:1 4:0.5 5:0 # clueweb09-en9313-84-41193

The first column is the score or target value of the <q, d> pair. In a training file, use the relevance value obtained for this <q, d> pair from the relevance judgments ("qrels") file. In a test file, this value should be 0.

The second column is the query id.

The next n columns are feature_id:feature_value pairs, where n is the number of features. Feature ids must be integers starting with 1. Features may be integers or floats. Features must be in canonical order (1 first, 2 next, etc). If a feature does not apply to a particular <q, d> pair (e.g., the document does not have an inlink field), SVM^rank allows you to omit it (sparse representation), but RankLib does not (complete representation).

A '#' character indicates a comment. The '#' character and everything after it until the end of the line is ignored by the toolkit. However, the grading software uses the comment field. Each line must end with the external document id.

The sort key is column 2 (ascending numeric order for the query id portion of the field).

New Score File

When your software uses a toolkit to calculate new scores for <q,d> pairs, the toolkit writes the new scores to a file. Re-rank the documents based on these new scores. Higher scores are better.

SVM^rank

The file contains one score per line, as shown below. The scores are listed in the same order as the feature vector file, above.

1.48250444
1.70125248
1.15657135

RankLib

Each line of the file contains a query id, document rank, and score, separated by tab characters, as shown below. The scores are listed in the same order as the feature vector file, above.

059	0	1.48250444
059	1	1.70125248
059	2	1.15657135
 :	:	  :    :
102	1	0.992092250788071
 :	:	  :    :

Missing or Empty Fields

When generating a feature for field-specific features (e.g., f₈, which is the BM25 score for <q, d_title>), your software may encounter documents that do not have that field (the index returns None) or that have an empty field (the field length is 0). Your software should not generate a feature for these documents.

If your software generates a value of 0 for these cases, it will not match the reference software.

Normalizing Feature Values

SVM^rank

The most convenient software architecture creates a feature vector, writes it to disk, creates the next feature vector, writes it to disk, and so on. Unfortunately, SVM^rank is more effective if your features are all normalized to be in the same range, for example [0..1]. Your SVM^rank features must be normalized to [0..1].

Your software does not know the range of query likelihood, BM25, PageRank, and other features in advance, thus it cannot normalize feature vectors incrementally. Instead, it must wait, and do normalization after all feature vectors for a particular query are created.

To normalize a feature (e.g., f₅) for a specific query, identify the maximum and minimum values for that feature, and then do standard [0..1] normalization. For example, the normalized value for feature f₅ (q₃, d₂₁) is:

(featureValue_f₅ (q₃, d₂₁) - minValue_f₅ (q₃)) / (maxValue_f₅ (q₃) - minValue_f₅ (q₃)).

featureValue_f₅ (q₃, d₂₁): The value of feature f₅ for query q3 and document d₂₁.
minValue_f₅ (q₃): The minimum value of feature f₅ across all feature vectors for query q₃.
maxValue_f₅ (q₃): The maximum value of feature f₅ across all feature vectors for query q₃.

If the min and max are the same value, set the feature value to 0.

Thus, each feature is normalized separately and for each query.

Beware: When a feature does not exist for a document (e.g., the document doesn't have a PageRank score, or an inlink field), you may want to set the feature value to zero. However, be sure that you set it to zero after normalization. If you set the value before normalization, you may be inadvertently creating a new minimum or maximum value, which will produce incorrect results.

RankLib

It is not necessary to normalize features for RankLib. If you want features to be normalized, do not do it yourself. Instead, your software may use a parameter to specify the type of normalization that RankLib should do.

Using the Toolkits

This assignment uses two machine learning toolkits that are called in different ways.

SVM^rank uses SVM and pairwise training to train a model. It consists of two C++ software applications: svm_rank_learn and svm_rank_classify. Executables for Mac, Linux, Windows, and Windows with Cygwin are provided, or you can compile your own version using the source code. The Linux executable requires a recent version of gcc.

Call it from your Python software as shown below.


      # Train a model
    output = subprocess.check_output(
        '<ltr:svmRankLearnPath> -c <ltr:svmRankParamC> <ltr:trainingFeatureVectorsFile> <ltr:modelFile>'
        stderr=subprocess.STDOUT,
        shell=True).decode('UTF-8')

    # Generate new ranking scores
    output = subprocess.check_output(
        '<ltr:svmClassifyPath> <ltr:testingFeatureVectorsFile> <ltr:modelFile> <ltr:testingDocumentScores',
        stderr=subprocess.STDOUT,
        shell=True).decode('UTF-8')

RankLib provides several pairwise and listwise algorithms for training models. The LIB_DIR created for HW1 contains a jar file that can be invoked as an external application (as above) or called directly (more efficient, fewer problems).

Call it from your Python software as shown below


        # Train a model
    PyLu.RankLib.main([
        '-train',   <ltr:trainingFeatureVectorsFile>,
        '-ranker',  <ltr:RankLib:model>,
        '-save',    <ltr:modelFile>,
        '-metric2t' <ltr:RankLib:metric2t>]) # optional parameter

    # Generate new ranking scores
    PyLu.RankLib.main([
        '-rank',  <ltr:testingFeatureVectorsFile>,
        '-load',  <ltr:modelFile>,
        '-score', <ltr:tesetingDocumentScores>])

See the RankLib documentation for information about other parameters, including the parameters needed to produce scores for <q,d> pairs.

Implementation Order

HW2 has many moving pieces, but each of them is fairly simple. We recommend the following development path.

Make sure that you can run svm_rank_learn.exe and svm_rank_classify.exe from the command line. Typically, this is a quick check for Windows and Linux users, but Mac people need to get the permissions right. See the FAQ.)
Start with the simplest Ltr configuration first. Focus on test cases that use .inRank files and the RankLib toolkit, because this avoids problems in your initial ranking or feature normalization. Features f1-f4 are the easiest, so do them first. Then do the overlap features. Then do the BM25 features. Then do the query likelihood features.
The grade consists of two components: i) a grade for correct feature values in your .LtrTrain file, and ii) a grade for a correct reranking. Focus first on the feature values. After they are correct, focus on the reranking grade. The reranking grade is affected by the initial ranking (the .inRank file is correct), the learned model (presumably your features now correct, so make sure your parameters are correct), communicating correctly with the toolkit, and reranking by the new scores.
When the .inRank + Ranklib configuration works properly, focus on the test cases that use your initial ranking. If your HW1 system worked properly, there shouldn't be much work here.
Finally, focus on the svmRank cases, which require feature normalization. See the hints about common feature normalization problems. (Mac people may also need to sort out permissions issues. See the FAQ.) By this point, you are pretty sure that the unnormalized feature values are correct, so any problems are probably caused by not setting the min & max values properly.

FAQ

If you have questions not answered here, see the HW2 FAQ and the Homework Testing FAQ.

Jamie Callan

HW2: Learning to Rank Design Guide

Software Architecture

Efficiency

File Formats

Feature Vector File

New Score File

SVMrank

RankLib

Missing or Empty Fields

Normalizing Feature Values

SVMrank

RankLib

Using the Toolkits

Implementation Order

FAQ

SVM^rank

SVM^rank