Software Architecture
Efficiency
File Formats
Normalizing Feature Values
Using the Toolkits
Implementation Order
You could reuse your existing Indri and BM25 implementations to do this assignment, but your software architecture would be complicated and slow. It is faster, simpler, and more common to develop new implementations that are designed for reranking. Your software is only tested with unstructured queries, so much of what made HW1-HW2 complicated (query operators, default beliefs, structured queries) won't occur in this homework.
Each stage of the ranking pipeline processes a set of queries, which may seem unintuitive. For example, if the .qry file contains three queries, the initial ranker (e.g., BM25) ranks all of them and then passes the results to the learning-to-rank stage to rerank all of them. This architecture is a little simpler to build and faster to run than the more obvious query-at-a-time architecture.
Your LTR reranking software should be structured more-or-less as follows:
// generate training data while a training query q is available: use QryParser.tokenizeString to stop & stem the query foreach document d in the relevance judgements for training query q: use Idx.getTermVector to get d's term vector generate a feature vector for <q, d> if toolkit is SVMrank: normalize the feature values for query q to [0..1] write the feature vectors to file // train a model call a toolkit to train a model // generate testing data for top n documents in initial BM25 ranking while (a test query q is available): use QryParser.tokenizeQuery to stop & stem the query foreach document d in the top n of initial ranking for q: use Idx.getTermVector to get d's term vector generate a feature vector for <q, d> if toolkit is SVMrank: normalize the feature values for query q to [0..1] write the feature vectors to file // re-rank test data (all queries) call the toolkit to produce scores for the test data read the new scores and use them to re-rank the initial rankings write the re-ranked results in trec_eval format
Feature vectors are generated more-or-less as follows.
// generate feature vector create an empty feature vector use Idx.getAttribute to get the PageRank, spam, and Wikipedia features for d from the index fetch the term vector for d calculate other features for <q, d>
The main advantage of this implementation is its efficiency and simplicity. Its main disadvantage is that you must reimplement BM25 and Indri to work from a term vector data structure. Assume BOW queries (no query operators), which simplifies implementation considerably. Pseudo code is shown below for a BM25 implementation that uses term vectors.
// tokenize the query before calling featureBM25 queryStems = QryParser.tokenizeQuery (query) // generate a BM25 feature def featureBM25 (qid, queryStems, termVector): score = 0 for stem in queryStems: if stem in field in termvector: score += BM25 term score for stem return score
For the Indri retrieval model features, if a field does not match any term of a query, the score for the field is 0. This is consistent with what your HW1-HW2 Indri implementation did - documents that have no terms in common with the query were not given a score.
Note: If you try to instantiate a TermVector for a document field that does not exist (e.g., an inlink field for a document that has no inlinks), the constructor returns an empty TermVector. It is easy to recognize an empty TermVector: The positionsLength and stemsLength methods will return 0 (i.e., the field does not contain anything).
Probably your HW2 system cached constant values, for example, the average title field length; and values that are constant for a particular query, for example, the RSJ or J-M score for "apple" in the "body" field. It was natural for the QrySopScore operator to cache those values for BM25 and Indri retrieval models.
Caching constant and query-constant values is even more important for LTR rerankers. You have freedom in how you implement this capability in your LTR reranker. One approach is to cache constant and query-constant values when the feature generator is initialized (or reinitialized for a particular query).
Caching can become a distraction and a source of frustrating errors. Think about caching first, and aim for simplicity. The training queries and test queries define the vocabulary that benefits from caching. The active features define the values that might benefit from caching. For example, only cache the RSJ value for "apple.title" if you are using feature f8. Think of retrieval model scoring as divided into two parts: i) document-independent, and ii) document-specific.
Pro tip: Think about caching right at the start when you are building your first retrieval model feature. Your architecture will be simpler and cleaner. If you wait until you have a working (but slow) system, you will do twice as much work, and you are more likely to have little bugs that waste your time.
Your system will use files to communicate with the machine learning toolkits. It will write a feature vector file to describe <q,d> pairs to the toolkit, and it will read a file of <q,d> scores to rerank documents.
Both toolkits use the same feature vector file format. Each line contains the feature vector for one <q, d> pair. Several example records are below.
2 qid:1 1:1 2:1 3:0 4:0.2 5:0 # clueweb09-en0000-48-24794 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en0011-93-16495 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # clueweb09-en0132-64-99898 0 qid:1 1:0 2:0 3:1 4:0.3 5:0 # clueweb09-en0370-54-24993 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en2108-01-81090 2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # clueweb09-en0760-49-43194 1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # clueweb09-en6303-47-69892 0 qid:2 1:0 2:0 3:1 4:0.2 5:0 # clueweb09-en6373-93-83391 2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # clueweb09-en2049-70-10797 1 qid:3 1:1 2:1 3:0 4:0.3 5:0 # clueweb09-en1703-12-05495 1 qid:3 1:1 2:0 3:0 4:0.4 5:1 # clueweb09-en8278-44-45799 0 qid:3 1:0 2:1 3:1 4:0.5 5:0 # clueweb09-en9313-84-41193
The first column is the score or target value of the <q, d> pair. In a training file, use the relevance value obtained for this <q, d> pair from the relevance judgments ("qrels") file. In a test file, this value should be 0.
The second column is the query id.
The next n columns are feature_id:feature_value pairs, where n is the number of features. Feature ids must be integers starting with 1. Features may be integers or floats. Features must be in canonical order (1 first, 2 next, etc). If a feature does not apply to a particular <q, d> pair (e.g., the document does not have an inlink field), SVMrank allows you to omit it (sparse representation), but RankLib does not (complete representation).
A '#' character indicates a comment. The '#' character and everything after it until the end of the line is ignored by the toolkit. However, the grading software uses the comment field. Each line must end with the external document id.
The sort key is column 2 (ascending numeric order for the query id portion of the field).
When your software uses a toolkit to calculate new scores for <q,d> pairs, the toolkit writes the new scores to a file. Re-rank the documents based on these new scores. Higher scores are better.
The file contains one score per line, as shown below. The scores are listed in the same order as the feature vector file, above.
1.48250444 1.70125248 1.15657135
Each line of the file contains a query id, document rank, and score, separated by tab characters, as shown below. The scores are listed in the same order as the feature vector file, above.
059 0 1.48250444 059 1 1.70125248 059 2 1.15657135 : : : : 102 1 0.992092250788071 : : : :
The most convenient software architecture creates a feature vector, writes it to disk, creates the next feature vector, writes it to disk, and so on. Unfortunately, SVMrank is more effective if your features are all normalized to be in the same range, for example [0..1]. Your SVMrank features must be normalized to [0..1].
Your software does not know the range of Indri, BM25, PageRank, and other features in advance, thus it cannot normalize feature vectors incrementally. Instead, it must wait, and do normalization after all feature vectors for a particular query are created.
To normalize a feature (e.g., f5) for a specific query, identify the maximum and minimum values for that feature, and then do standard [0..1] normalization. For example, the normalized value for feature f5 (q3, d21) is:
(featureValue_f5 (q3, d21) - minValue_f5 (q3)) / (maxValue_f5 (q3) - minValue_f5 (q3)).
featureValue_f5 (q3, d21): The value of feature f5 for query q3 and document d21.
minValue_f5 (q3): The minimum value of feature f5 across all feature vectors for query q3.
maxValue_f5 (q3): The maximum value of feature f5 across all feature vectors for query q3.
If the min and max are the same value, set the feature value to 0.
Thus, each feature is normalized separately and for each query.
Beware: When a feature does not exist for a document (e.g., the document doesn't have a PageRank score, or an inlink field), you may want to set the feature value to zero. However, be sure that you set it to zero after normalization. If you set the value before normalization, you may be inadvertently creating a new minimum or maximum value, which will produce incorrect results.
It is not necessary to normalize features for RankLib. If you want features to be normalized, do not do it yourself. Instead, your software may use a parameter to specify the type of normalization that RankLib should do.
This assignment uses two machine learning toolkits that are called in different ways.
SVMrank uses SVM and pairwise training to
train a model. It consists of two C++ software applications:
svm_rank_learn
and svm_rank_classify
.
Executables for Mac, Linux, Windows,
and Windows with Cygwin are provided, or you can compile your
own version using the
source
code. The Linux executable requires a recent version of gcc.
Call it from your Python software as shown below.
# Train a model
output = subprocess.check_output(
'<ltr:svmRankLearnPath> -c <ltr:svmRankParamC> <ltr:trainingFeatureVectorsFile> <ltr:modelFile>'
stderr=subprocess.STDOUT,
shell=True).decode('UTF-8')
# Generate new ranking scores
output = subprocess.check_output(
'<ltr:svmClassifyPath> <ltr:testingFeatureVectorsFile> <ltr:modelFile> <ltr:testingDocumentScores',
stderr=subprocess.STDOUT,
shell=True).decode('UTF-8')
RankLib provides several pairwise and listwise algorithms for training models. A jar file is provided that can be invoked as an external application (as above) or called directly (more efficient, fewer problems).
Call it from your Python software as shown below
# Train a model
PyLu.RankLib.main([
'-train', <ltr:trainingFeatureVectorsFile>,
'-ranker', <ltr:RankLib:model>,
'-save', <ltr:modelFile>,
'-metric2t' <ltr:RankLib:metric2t>]) # optional parameter
# Generate new ranking scores
PyLu.RankLib.main([
'-rank', <ltr:testingFeatureVectorsFile>,
'-load', <ltr:modelFile>,
'-score', <ltr:tesetingDocumentScores>])
See the RankLib documentation for information about other parameters, including the parameters needed to produce scores for <q,d> pairs.
HW3 has many moving pieces, but each of them is fairly simple. We recommend the following development path.
Start with the simplest Ltr configuration first. Focus on test cases that use .inRank files and the RankLib toolkit, because this avoids problems in your initial ranking or feature normalization. Features f1-f4 are the easiest, so do them first. Then do the overlap features. Then do the BM25 features. Then do the Indri features.
The grade consists of two components: i) a grade for correct feature values in your .LtrTrain file, and ii) a grade for a correct reranking. Focus first on the feature values. After they are correct, focus on the reranking grade. The reranking grade is affected by the initial ranking (the .inRank file is correct), the learned model (presumably your features now correct, so make sure your parameters are correct), communicating correctly with the toolkit, and reranking by the new scores.
When the .inRank + Ranklib configuration works properly, focus on the test cases that use your initial ranking. If your HW2 system worked properly, there shouldn't be much work here.
Finally, focus on the svmRank cases, which require feature normalization. See the hints about common feature normalization problems. (Mac people may also need to sort out permissions issues. See the FAQ.) By this point, you are pretty sure that the unnormalized feature values are correct, so any problems are probably caused by not setting the min & max values properly.
If you have questions not answered here, see the HW3 FAQ and the Homework Testing FAQ.
Copyright 2024, Carnegie Mellon University.
Updated on February 14, 2024