Software Architecture
Efficiency
File Formats
Missing or Empty Fields
Normalizing Feature Values
Using the Toolkits
Implementation Order
Each stage of the ranking pipeline processes a set of queries, which may seem unintuitive. For example, if the .qry file contains three queries, the initial ranker (e.g., BM25) ranks all of them and then passes the results to the learning-to-rank stage to rerank all of them. This architecture is a little simpler to build and faster to run than the more obvious query-at-a-time architecture.
Your LTR reranking software should be structured more-or-less as follows:
# generate training data while a training query q is available: use QryParser.tokenizeString to stop & stem the query foreach document d in the relevance judgements for training query q: use Idx.getTermVector to get d's term vector generate a feature vector for <q, d> if toolkit is SVMrank: normalize the feature values for query q to [0..1] write the feature vectors to file # train a model call a toolkit to train a model # generate testing data for top n documents in initial BM25 ranking while (a test query q is available): use QryParser.tokenizeQuery to stop & stem the query foreach document d in the top n of initial ranking for q: use Idx.getTermVector to get d's term vector generate a feature vector for <q, d> if toolkit is SVMrank: normalize the feature values for query q to [0..1] write the feature vectors to file # re-rank test data (all queries) call the toolkit to produce scores for the test data read the new scores and use them to re-rank the initial rankings write the re-ranked results in trec_eval format
Feature vectors are generated more-or-less as follows.
# generate feature vector create an empty feature vector use Idx.getAttribute to get the PageRank, spam, and Wikipedia features for d from the index fetch the term vector(s) for d calculate other features for <q, d>
The main advantage of this implementation is its efficiency and simplicity. Its main disadvantage is that you must reimplement BM25 to work from a term vector data structure. Assume BOW queries (no query operators) and that you are just reranking the top n documents provided by the initial ranker, which simplifies implementation considerably. Pseudo code is shown below for a BM25 implementation that uses term vectors.
# tokenize the query before calling featureBM25 queryStems = QryParser.tokenizeQuery (query) # generate a BM25 feature def featureBM25 (qid, queryStems, termVector): score = 0 for stem in queryStems: if stem in termvector: score += BM25 term score for stem return score
Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).
Probably your HW1 system cached constant values, for example, the average title field length; and values that are constant for a particular query, for example, the RSJ score for "apple" in the "body" field. It was natural for the QrySopScore operator to cache those values for BM25.
Caching constant and query-constant values is even more important for LTR rerankers. You have freedom in how you implement this capability in your LTR reranker. One approach is to cache constant and query-constant values when the feature generator is initialized (or reinitialized for a particular query).
Caching can become a distraction and a source of frustrating errors. Think about caching first, and aim for simplicity. The training queries and test queries define the vocabulary that benefits from caching. The active features define the values that might benefit from caching. For example, only cache the RSJ value for "apple.title" if you are using feature f8. Think of retrieval model scoring as divided into two parts: i) document-independent (cache), and ii) document-specific (don't cache).
Pro tip: Think about caching right at the start when you are building your first retrieval model feature. Your architecture will be simpler and cleaner. If you wait until you have a working (but slow) system, you will do twice as much work, and you are more likely to have little bugs that waste your time.
Your system will use files to communicate with the machine learning toolkits. It will write a feature vector file to describe <q,d> pairs to the toolkit, and it will read a file of <q,d> scores to rerank documents.
Both toolkits use the same feature vector file format. Each line contains the feature vector for one <q, d> pair. Several example records are below.
2 qid:1 1:1 2:1 3:0 4:0.2 5:0 # clueweb09-en0000-48-24794 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en0011-93-16495 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # clueweb09-en0132-64-99898 0 qid:1 1:0 2:0 3:1 4:0.3 5:0 # clueweb09-en0370-54-24993 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en2108-01-81090 2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # clueweb09-en0760-49-43194 1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # clueweb09-en6303-47-69892 0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en6373-93-83391 2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en2049-70-10797 1 qid:3 1:1 2:1 3:0 4:0.3 5:0 # clueweb09-en1703-12-05495 1 qid:3 1:1 2:0 3:0 4:0.4 5:1 # clueweb09-en8278-44-45799 0 qid:3 1:0 2:1 3:1 4:0.5 5:0 # clueweb09-en9313-84-41193
The first column is the score or target value of the <q, d> pair. In a training file, use the relevance value obtained for this <q, d> pair from the relevance judgments ("qrels") file. In a test file, this value should be 0.
The second column is the query id.
The next n columns are feature_id:feature_value pairs, where n is the number of features. Feature ids must be integers starting with 1. Features may be integers or floats. Features must be in canonical order (1 first, 2 next, etc). If a feature does not apply to a particular <q, d> pair (e.g., the document does not have an inlink field), SVMrank allows you to omit it (sparse representation), but RankLib does not (complete representation).
A '#' character indicates a comment. The '#' character and everything after it until the end of the line is ignored by the toolkit. However, the grading software uses the comment field. Each line must end with the external document id.
The sort key is column 2 (ascending numeric order for the query id portion of the field).
When your software uses a toolkit to calculate new scores for <q,d> pairs, the toolkit writes the new scores to a file. Re-rank the documents based on these new scores. Higher scores are better.
The file contains one score per line, as shown below. The scores are listed in the same order as the feature vector file, above.
1.48250444 1.70125248 1.15657135
Each line of the file contains a query id, document rank, and score, separated by tab characters, as shown below. The scores are listed in the same order as the feature vector file, above.
059 0 1.48250444 059 1 1.70125248 059 2 1.15657135 : : : : 102 1 0.992092250788071 : : : :
When generating a feature for field-specific features (e.g., f8, which is the BM25 score for <q, dtitle>), your software may encounter documents that do not have that field (the index returns None) or that have an empty field (the field length is 0). Your software should not generate a feature for these documents.
If your software generates a value of 0 for these cases, it will not match the reference software.
The most convenient software architecture creates a feature vector, writes it to disk, creates the next feature vector, writes it to disk, and so on. Unfortunately, SVMrank is more effective if your features are all normalized to be in the same range, for example [0..1]. Your SVMrank features must be normalized to [0..1].
Your software does not know the range of query likelihood, BM25, PageRank, and other features in advance, thus it cannot normalize feature vectors incrementally. Instead, it must wait, and do normalization after all feature vectors for a particular query are created.
To normalize a feature (e.g., f5) for a specific query, identify the maximum and minimum values for that feature, and then do standard [0..1] normalization. For example, the normalized value for feature f5 (q3, d21) is:
(featureValue_f5 (q3, d21) - minValue_f5 (q3)) / (maxValue_f5 (q3) - minValue_f5 (q3)).
featureValue_f5 (q3, d21): The value of feature f5 for query q3 and document d21.
minValue_f5 (q3): The minimum value of feature f5 across all feature vectors for query q3.
maxValue_f5 (q3): The maximum value of feature f5 across all feature vectors for query q3.
If the min and max are the same value, set the feature value to 0.
Thus, each feature is normalized separately and for each query.
Beware: When a feature does not exist for a document (e.g., the document doesn't have a PageRank score, or an inlink field), you may want to set the feature value to zero. However, be sure that you set it to zero after normalization. If you set the value before normalization, you may be inadvertently creating a new minimum or maximum value, which will produce incorrect results.
It is not necessary to normalize features for RankLib. If you want features to be normalized, do not do it yourself. Instead, your software may use a parameter to specify the type of normalization that RankLib should do.
This assignment uses two machine learning toolkits that are called in different ways.
SVMrank uses SVM and pairwise training to
train a model. It consists of two C++ software applications:
svm_rank_learn
and svm_rank_classify
.
Executables for Mac, Linux, Windows,
and Windows with Cygwin are provided, or you can compile your
own version using the
source
code. The Linux executable requires a recent version of gcc.
Call it from your Python software as shown below.
# Train a model
output = subprocess.check_output(
'<ltr:svmRankLearnPath> -c <ltr:svmRankParamC> <ltr:trainingFeatureVectorsFile> <ltr:modelFile>'
stderr=subprocess.STDOUT,
shell=True).decode('UTF-8')
# Generate new ranking scores
output = subprocess.check_output(
'<ltr:svmClassifyPath> <ltr:testingFeatureVectorsFile> <ltr:modelFile> <ltr:testingDocumentScores',
stderr=subprocess.STDOUT,
shell=True).decode('UTF-8')
RankLib provides several pairwise and listwise algorithms for training models. The LIB_DIR created for HW1 contains a jar file that can be invoked as an external application (as above) or called directly (more efficient, fewer problems).
Call it from your Python software as shown below
# Train a model
PyLu.RankLib.main([
'-train', <ltr:trainingFeatureVectorsFile>,
'-ranker', <ltr:RankLib:model>,
'-save', <ltr:modelFile>,
'-metric2t' <ltr:RankLib:metric2t>]) # optional parameter
# Generate new ranking scores
PyLu.RankLib.main([
'-rank', <ltr:testingFeatureVectorsFile>,
'-load', <ltr:modelFile>,
'-score', <ltr:tesetingDocumentScores>])
See the RankLib documentation for information about other parameters, including the parameters needed to produce scores for <q,d> pairs.
HW3 has many moving pieces, but each of them is fairly simple. We recommend the following development path.
Make sure that you can run svm_rank_learn.exe and svm_rank_classify.exe from the command line. Typically, this is a quick check for Windows and Linux users, but Mac people need to get the permissions right. See the FAQ.)
Start with the simplest Ltr configuration first. Focus on test cases that use .inRank files and the RankLib toolkit, because this avoids problems in your initial ranking or feature normalization. Features f1-f4 are the easiest, so do them first. Then do the overlap features. Then do the BM25 features. Then do the query likelihood features.
The grade consists of two components: i) a grade for correct feature values in your .LtrTrain file, and ii) a grade for a correct reranking. Focus first on the feature values. After they are correct, focus on the reranking grade. The reranking grade is affected by the initial ranking (the .inRank file is correct), the learned model (presumably your features now correct, so make sure your parameters are correct), communicating correctly with the toolkit, and reranking by the new scores.
When the .inRank + Ranklib configuration works properly, focus on the test cases that use your initial ranking. If your HW1 system worked properly, there shouldn't be much work here.
Finally, focus on the svmRank cases, which require feature normalization. See the hints about common feature normalization problems. (Mac people may also need to sort out permissions issues. See the FAQ.) By this point, you are pretty sure that the unnormalized feature values are correct, so any problems are probably caused by not setting the min & max values properly.
If you have questions not answered here, see the HW3 FAQ and the Homework Testing FAQ.
Copyright 2024, Carnegie Mellon University.
Updated on November 06, 2024