Search Engines:
11-442 / 11-642
 
CMU logo
 

HW3: Learning to Rank Design Guide

Software Architecture
Efficiency
File Formats
Missing or Empty Fields
Normalizing Feature Values
Using the Toolkits
Implementation Order

 

Software Architecture

Each stage of the ranking pipeline processes a set of queries, which may seem unintuitive. For example, if the .qry file contains three queries, the initial ranker (e.g., BM25) ranks all of them and then passes the results to the learning-to-rank stage to rerank all of them. This architecture is a little simpler to build and faster to run than the more obvious query-at-a-time architecture.

Your LTR reranking software should be structured more-or-less as follows:

  # generate training data
  while a training query q is available:
    use QryParser.tokenizeString to stop & stem the query
    foreach document d in the relevance judgements for training query q:
      use Idx.getTermVector to get d's term vector
      generate a feature vector for <q, d>

    if toolkit is SVMrank:
      normalize the feature values for query q to [0..1] 
    write the feature vectors to file

  # train a model
  call a toolkit to train a model

  # generate testing data for top n documents in initial BM25 ranking
  while (a test query q is available):
    use QryParser.tokenizeQuery to stop & stem the query
    foreach document d in the top n of initial ranking for q:
      use Idx.getTermVector to get d's term vector
      generate a feature vector for <q, d>

    if toolkit is SVMrank:
      normalize the feature values for query q to [0..1] 
    write the feature vectors to file

  # re-rank test data (all queries)
  call the toolkit to produce scores for the test data
  read the new scores and use them to re-rank the initial rankings
  write the re-ranked results in trec_eval format

Feature vectors are generated more-or-less as follows.

  # generate feature vector
  create an empty feature vector
  use Idx.getAttribute to get the PageRank, spam, and Wikipedia features for d from the index
  fetch the term vector(s) for d
  calculate other features for <q, d>

The main advantage of this implementation is its efficiency and simplicity. Its main disadvantage is that you must reimplement BM25 to work from a term vector data structure. Assume BOW queries (no query operators) and that you are just reranking the top n documents provided by the initial ranker, which simplifies implementation considerably. Pseudo code is shown below for a BM25 implementation that uses term vectors.

  # tokenize the query before calling featureBM25
  queryStems = QryParser.tokenizeQuery (query)

  # generate a BM25 feature
  def featureBM25 (qid, queryStems, termVector):
      score = 0
      for stem in queryStems:
          if stem in termvector:
              score += BM25 term score for stem

      return score

Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).

 

Efficiency

Probably your HW1 system cached constant values, for example, the average title field length; and values that are constant for a particular query, for example, the RSJ score for "apple" in the "body" field. It was natural for the QrySopScore operator to cache those values for BM25.

Caching constant and query-constant values is even more important for LTR rerankers. You have freedom in how you implement this capability in your LTR reranker. One approach is to cache constant and query-constant values when the feature generator is initialized (or reinitialized for a particular query).

Caching can become a distraction and a source of frustrating errors. Think about caching first, and aim for simplicity. The training queries and test queries define the vocabulary that benefits from caching. The active features define the values that might benefit from caching. For example, only cache the RSJ value for "apple.title" if you are using feature f8. Think of retrieval model scoring as divided into two parts: i) document-independent (cache), and ii) document-specific (don't cache).

Pro tip: Think about caching right at the start when you are building your first retrieval model feature. Your architecture will be simpler and cleaner. If you wait until you have a working (but slow) system, you will do twice as much work, and you are more likely to have little bugs that waste your time.

 

File Formats

Your system will use files to communicate with the machine learning toolkits. It will write a feature vector file to describe <q,d> pairs to the toolkit, and it will read a file of <q,d> scores to rerank documents.

Feature Vector File

Both toolkits use the same feature vector file format. Each line contains the feature vector for one <q, d> pair. Several example records are below.

2 qid:1 1:1 2:1 3:0 4:0.2 5:0 # clueweb09-en0000-48-24794
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en0011-93-16495
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # clueweb09-en0132-64-99898
0 qid:1 1:0 2:0 3:1 4:0.3 5:0 # clueweb09-en0370-54-24993
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en2108-01-81090
2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # clueweb09-en0760-49-43194
1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # clueweb09-en6303-47-69892
0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en6373-93-83391
2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en2049-70-10797
1 qid:3 1:1 2:1 3:0 4:0.3 5:0 # clueweb09-en1703-12-05495
1 qid:3 1:1 2:0 3:0 4:0.4 5:1 # clueweb09-en8278-44-45799
0 qid:3 1:0 2:1 3:1 4:0.5 5:0 # clueweb09-en9313-84-41193

The first column is the score or target value of the <q, d> pair. In a training file, use the relevance value obtained for this <q, d> pair from the relevance judgments ("qrels") file. In a test file, this value should be 0.

The second column is the query id.

The next n columns are feature_id:feature_value pairs, where n is the number of features. Feature ids must be integers starting with 1. Features may be integers or floats. Features must be in canonical order (1 first, 2 next, etc). If a feature does not apply to a particular <q, d> pair (e.g., the document does not have an inlink field), SVMrank allows you to omit it (sparse representation), but RankLib does not (complete representation).

A '#' character indicates a comment. The '#' character and everything after it until the end of the line is ignored by the toolkit. However, the grading software uses the comment field. Each line must end with the external document id.

The sort key is column 2 (ascending numeric order for the query id portion of the field).

New Score File

When your software uses a toolkit to calculate new scores for <q,d> pairs, the toolkit writes the new scores to a file. Re-rank the documents based on these new scores. Higher scores are better.

SVMrank

The file contains one score per line, as shown below. The scores are listed in the same order as the feature vector file, above.

1.48250444
1.70125248
1.15657135

RankLib

Each line of the file contains a query id, document rank, and score, separated by tab characters, as shown below. The scores are listed in the same order as the feature vector file, above.

059	0	1.48250444
059	1	1.70125248
059	2	1.15657135
 :	:	  :    :
102	1	0.992092250788071
 :	:	  :    :

 

Missing or Empty Fields

When generating a feature for field-specific features (e.g., f8, which is the BM25 score for <q, dtitle>), your software may encounter documents that do not have that field (the index returns None) or that have an empty field (the field length is 0). Your software should not generate a feature for these documents.

If your software generates a value of 0 for these cases, it will not match the reference software.

 

Normalizing Feature Values

SVMrank

The most convenient software architecture creates a feature vector, writes it to disk, creates the next feature vector, writes it to disk, and so on. Unfortunately, SVMrank is more effective if your features are all normalized to be in the same range, for example [0..1]. Your SVMrank features must be normalized to [0..1].

Your software does not know the range of query likelihood, BM25, PageRank, and other features in advance, thus it cannot normalize feature vectors incrementally. Instead, it must wait, and do normalization after all feature vectors for a particular query are created.

To normalize a feature (e.g., f5) for a specific query, identify the maximum and minimum values for that feature, and then do standard [0..1] normalization. For example, the normalized value for feature f5 (q3, d21) is:

(featureValue_f5 (q3, d21) - minValue_f5 (q3)) / (maxValue_f5 (q3) - minValue_f5 (q3)).

featureValue_f5 (q3, d21): The value of feature f5 for query q3 and document d21.
minValue_f5 (q3): The minimum value of feature f5 across all feature vectors for query q3.
maxValue_f5 (q3): The maximum value of feature f5 across all feature vectors for query q3.

If the min and max are the same value, set the feature value to 0.

Thus, each feature is normalized separately and for each query.

Beware: When a feature does not exist for a document (e.g., the document doesn't have a PageRank score, or an inlink field), you may want to set the feature value to zero. However, be sure that you set it to zero after normalization. If you set the value before normalization, you may be inadvertently creating a new minimum or maximum value, which will produce incorrect results.

RankLib

It is not necessary to normalize features for RankLib. If you want features to be normalized, do not do it yourself. Instead, your software may use a parameter to specify the type of normalization that RankLib should do.

 

Using the Toolkits

This assignment uses two machine learning toolkits that are called in different ways.

  1. SVMrank uses SVM and pairwise training to train a model. It consists of two C++ software applications: svm_rank_learn and svm_rank_classify. Executables for Mac, Linux, Windows, and Windows with Cygwin are provided, or you can compile your own version using the source code. The Linux executable requires a recent version of gcc.

    Call it from your Python software as shown below.

        # Train a model
        output = subprocess.check_output(
            '<ltr:svmRankLearnPath> -c <ltr:svmRankParamC> <ltr:trainingFeatureVectorsFile> <ltr:modelFile>'
            stderr=subprocess.STDOUT,
            shell=True).decode('UTF-8')
    
        # Generate new ranking scores
        output = subprocess.check_output(
            '<ltr:svmClassifyPath> <ltr:testingFeatureVectorsFile> <ltr:modelFile> <ltr:testingDocumentScores',
            stderr=subprocess.STDOUT,
            shell=True).decode('UTF-8')
      
  2. RankLib provides several pairwise and listwise algorithms for training models. The LIB_DIR created for HW1 contains a jar file that can be invoked as an external application (as above) or called directly (more efficient, fewer problems).

    Call it from your Python software as shown below

        # Train a model
        PyLu.RankLib.main([
            '-train',   <ltr:trainingFeatureVectorsFile>,
            '-ranker',  <ltr:RankLib:model>,
            '-save',    <ltr:modelFile>,
            '-metric2t' <ltr:RankLib:metric2t>]) # optional parameter
    
        # Generate new ranking scores
        PyLu.RankLib.main([
            '-rank',  <ltr:testingFeatureVectorsFile>,
            '-load',  <ltr:modelFile>,
            '-score', <ltr:tesetingDocumentScores>])
      

    See the RankLib documentation for information about other parameters, including the parameters needed to produce scores for <q,d> pairs.

 

Implementation Order

HW3 has many moving pieces, but each of them is fairly simple. We recommend the following development path.

 

FAQ

If you have questions not answered here, see the HW3 FAQ and the Homework Testing FAQ.


Copyright 2024, Carnegie Mellon University.
Updated on November 06, 2024

Jamie Callan