Learning to Rank

Search Engines:
11-442 / 11-642

HW2: Learning to Rank
Due Feb 17, 11:59pm

Assignment Overview

The purpose of this assignment is to gain experience with using machine learning algorithms to train feature-based retrieval models.

Add learning-to-rank capabilities to the QryEval search engine.
Develop several custom features for learning-to-rank.
Conduct experiments with your search engine.
Write a report about your work.
Upload your work to the course website for grading.

1. New Retrieval Capabilities

Ranking with a feature-based learning to rank architecture requires that your system have three new capabilities: i) features that indicate how well a document satisfies a query, ii) the ability to train a model that combines feature values into a ranking score, and ii) the ability to use the model to rank documents for new queries.

1.1. Search Engine Architecture

The QryEval search engine supports a reranking architecture and provides an initial, unfinished Reranker class. HW2 requires you to develop a learning-to-rank (LTR) reranker. HW4 requires you to develop a neural reranker. Capabilities that you expect to be general should be implemented in the Reranker class. Capabilities that you expect to be specific to a particular algorithm should be implemented in a class for that algorithm (e.g., RerankWithLtr).

The learning-to-rank reranker needs to support the following capabilities.

Train a model. It is most convenient to do this when the RerankerXxx object is initialized, before any queries are reranked.
Given a query, create feature vectors for the top n documents in an initial ranking (e.g., created by BM25 or read from a file). Note that the initial ranking may not contain n documents (it may be longer or shorter).
Use the trained model and feature vectors to calculate new document scores.
Use the new scores to rerank the top n documents from the initial ranking. Note that the initial ranking may not contain n documents (it may be longer or shorter).
Write the ranking to a .teIn file.

See the Design Guide for advice about how to implement these capabilities.

1.2 Features

Your program must implement the following features.

f₁: Spam score for d (read from index).
The spam score is stored in the index as the spamScore attribute. It is returned as a string.
spam_score = float(Idx.getAttribute ("spamScore", docid))
The score represents the document percentile in a spam quality ranking. Probably a document in the 1% percentile is spam. Probably a document in the 99% percentile is not spam.
f₂: Url depth for d (number of '/' in the rawUrl field).
The raw URL is stored in your index as the rawUrl attribute.
raw_url = Idx.getAttribute ("rawUrl", docid)
f₃: FromWikipedia score for d (1 if the rawUrl contains "wikipedia.org", otherwise 0).
f₄: PageRank score for d (read from index).
The PageRank is stored in your index as the PageRank attribute. It is returned as a string.
page_rank_score = float (Idx.getAttribute ("PageRank", docid))
f₅: BM25 score for <q, d_body>.
f₆: Query Likelihood score with Dirichlet smoothing for <q, d_body>.
f₇: Term overlap score (also called Coordinate Match) for <q, d_body>.
Term overlap is defined as the count of query terms that match the document field.
f₈: BM25 score for <q, d_title>.
f₉: Query Likelihood score with Dirichlet smoothing for <q, d_title>.
f₁₀: Term overlap score (also called Coordinate Match) for <q, d_title>.
f₁₁: BM25 score for <q, d_url>.
f₁₂: Query Likelihood score with Dirichlet smoothing for <q, d_url>.
f₁₃: Term overlap score (also called Coordinate Match) for <q, d_url>.
f₁₄: BM25 score for <q, d_inlink>.
f₁₅: Query Likelihood score with Dirichlet smoothing for <q, d_inlink>.
f₁₆: Term overlap score (also called Coordinate Match) for <q, d_inlink>.
f₁₇:-f₂₀: Your custom features - use your imagination.

Use what you have learned in the course so far to guide your development of custom features. Feature quality is part of your grade (all students), and points will be deducted for features that are trivial (e.g., idf ²) or show a lack of understanding.

You have considerable discretion about what features you develop, but there needs to be a good reason why your feature might be expected to make a difference. Your features are hypotheses about what information improves search accuracy. There are many options, including the vector space model; the distance among query terms in the document; and some clever treatment of inlink text. Use your imagination. It is not necessary that your features actually improve accuracy (although we hope that they will). Your hypotheses are important, not the success of your hypotheses.

Note: Query-only features don't make much of a difference with these learning algorithms, thus won't receive credit.

Note to undergraduates: Feature quality is part of your grade, too.

1.3 Machine Learning Toolkits

This assignment uses two machine learning toolkits that have similar capabilities.

SVM^rank consists of two C++ software applications: svm_rank_learn and svm_rank_classify. Binaries for Mac, Linux, and Windows are provided, or you may compile your own version using the source code. Our Linux binary requires a recent version of gcc. The .param file indicates where these executables are stored.
RankLib provides several pairwise and listwise algorithms for training models. The Ranklib .jar file is included in the lucene-8.1.1 directory that you downloaded for HW1.

These toolkits read data from files (e.g., to train a model, to calculate document scores) and write data to files (e.g., the new document scores). Both algorithms use the same file formats.

1.4 Parameters

Your software must support all of the parameters used in previous homework, as well as the new parameters described below.

ranker:
- retrievalAlgorithm: The value is "BM25", "RankedBoolean", or inRankFile. If the value is "inRankFile", read the initial rankings from a file in .teIn format.
- inRankFile:Path: The path to a file that contains an initial ranking.
reranker_<n>: The value is a dict that configures a reranker. "n" indicates the reranker's position in the ranking pipeline (lower values first). HW2 uses just one reranker, "reranker_1".
- rerankAlgorithm: Which algorithm to use for reranking. The value is always "ltr" for HW2.
- rerankDepth: The maximum number of documents to rerank.
- ltr:trainingQueryFile: A file of training queries.
- ltr:trainingQrelsFile: The path to a file of relevance judgments. Column 1 is the query id. Column 2 is ignored. Column 3 is the document id. Column 4 indicates the degree of relevance (see below).
- ltr:featureDisable: An optional comma-separated list of features to disable for this experiment. For example, "ltr:featureDisable:": "6,9,12,15" disables all Query Likelihood features. Ignore any features that your system does not implement (e.g., 19). If this parameter is missing, no features are disabled.
- ltr:trainingFeatureVectorsFile: The file of feature vectors that your software writes for the training queries.
- ltr:testingFeatureVectorsFile: The file of feature vectors that your software writes for the testing queries.
- ltr:testingDocumentScores: The file of document scores that the learning toolkit writes for the testing feature vectors.
- ltr:toolkit: Which LTR toolkit to use. The value is either "RankLib" or "SVMRank".
- ltr:modelFile: The file where the learning toolkit saves the trained model.
- ltr:RankLib:model: The learning algorithm id. This HW uses 4 (Coordinate Ascent) and 7 (ListNet).
- ltr:RankLib:metric2t: Some RankLib algorithms (e.g., Coordinate Ascent) try to optimize for a specific metric. The options are "MAP", "NDCG@k" (e.g., NDCG@10), and "P@K" (e.g., P@10). ListNet ignores this parameter. If not specified, default to MAP for Coordinate Ascent.
- ltr:svmRankLearnPath: A path to the svm_rank_learn executable.
- ltr:svmRankClassifyPath: A path to the svm_rank_classify executable.
- ltr:svmRankParamC: The value of the c parameter for SVM^rank. 0.001 is a good default.
- ltr:BM25:b: The value of BM25 b for BM25 features.
- ltr:BM25:k_1: The value of BM25 k_1 for BM25 features.
- ltr:QL:mu: The value of mu for Query Likelihood features.

1.5 Relevance Assessments

The ltr:trainingQrelsFile parameter identifies a file of relevance assessments. Use this <qid, docid, label> data to generate training data. The relevance assessments were produced by different years of the TREC conference. Some queries are evaluated on a two-point scale.

0: not relevant
1: relevant

Some queries are evaluated on a five-point scale.

-2: spam (not relevant) - Treat this label as 0
0: not relevant
1: relevant
2: highly relevant
3: key (page or site is comprehensive and should be a top search result)
4: nav (page is a navigational result for the query; query meant "go here")

Your software should handle all of these labels.

1.6 Output

Your software must write search results to a file in trec_eval input format, as it did for previous homework. It must also the write training and testing feature vectors to files, as described above.

1.7 Testing Your Software

Use the HW2 Testing Page to access the trec_eval and homework testing services.

You may do local testing on your laptop, as you did for HW1 and HW2. The HW2 test cases and grading files (combined into a single directory) are available for download (zip, tgz). HW2 uses the same .qrel file used for HW2.

2. Experiments (11-442 Students)

You must conduct an experiment to demonstrate the effectiveness of your custom features. You will use the set of training queries (all of them) to train a model that can be used to re-rank documents for any query. (HW2-train.qry, HW2-train.qrel)

Use BM25 to generate an initial ranking for each test query, and then use the trained model to re-rank the top n documents.

Conduct experiments that examine the effects of each of your custom features. There will be six experiments with each learning algorithm.

LTR Base: As defined above.
LTR Base + f_i: Four experiments that each add one of your custom features to LTR Base.
LTR Base + f₁₇-f₂₀: An experiment that tests all of your features together.

Use a reranking depth of 100.

Do your experiments with the HW1 queries.

3. Experiments (11-642 Students)

You must conduct experiments and an analysis that investigate the effectiveness of your custom features and the learning-to-rank approach in different situations.

In each experiment, you will use the set of training queries (all of them) to train a model that can be used to re-rank documents for any query. (HW2-train.qry, HW2-train.qrel)

Use a reranking depth of 100.

Use BM25 to generate an initial ranking for each test query, and then usethe trained model to re-rank the top n documents.

3.1 Learning to Rank Baselines

Use your existing BM25 and RankedBoolean implementations to produce two baseline document rankings for the HW1 queries.

BM25 with unstructured, bag-of-words queries;
RankedBoolean with unstructured, bag-of-words queries.

Also use your learning-to-rank software to train models that use different types of features.

IR Fusion: BM25 and Query likelihood features (f₅, f₆, f₈, f₉, f₁₁, f₁₂, f₁₄, f₁₅)
Content-based: Features f₅-f₁₆
LTR Base: Features f₁-f₁₆

The learning-to-rank Experiments are conducted with all three LTR algorithms, to enable you to see whether different algorithms have similar behavior with each type of feature.

Test your models on the HW1 queries. Discuss the trends that you observe; whether the learned retrieval models behaved as you expected; how the learned retrieval models compare to the baseline methods and the full feature set; and any other observations that you may have.

3.2 Custom Features

Conduct experiments that examine the effects of each of your custom features. There will be six experiments with each learning algorithm.

LTR Base: As defined above.
LTR Base + f_i: Four experiments that each add one of your custom features to LTR Base.
LTR Base + f₁₇-f₂₀: An experiment that tests all of your features together.

Test your models on the HW1 queries. Discuss the trends that you observe, focusing on the contribution of your custom features to LTR Base features for each learning algorithm.

3.3 Feature Combinations

Experiment with four different combinations of features: Try to find a small set of features that delivers accurate results. We do not expect you to investigate all combinations of features. Your goal is to investigate the effectiveness of different groups of features, and to discard any that do not improve accuracy.

You may use whichever of the three learning algorithms you choose based on your prior experiments.

Discuss the trends that you observe; whether the learned retrieval models behaved as you expected; how the learned retrieval models compare to the baseline methods and the full feature set; and any other observations that you may have.

4. The Report

11-442 students must submit a brief report that contains a statement of collaboration and originality and describes your custom features. A template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

11-642 students must write a report that describes their work and their analysis of the experimental results. A report template is provided in Microsoft Word and pdf formats. The report must follow the structure provided in the template.

5. Submission Instructions

Create a .zip file that contains your software, following the same requirements used for interim software submissions. Name your report yourAndrewID-HW2-Report.pdf and place it in the same zip file directory that contains your software (e.g., the directory that contains QryEval.java).

Submit your homework by checking the "Final Submission" box in the homework testing service. We will run a complete set of tests on your software, so you do not need to select tests to run. If you make several final submissions, we will grade your last submission.

The Homework Services web page provides information about your homework submissions and access to graded homework reports.

6. Grading

The grading requirements and advice are the same as for HW1.

FAQ

If you have questions not answered here, see the HW2 FAQ and the Homework Testing FAQ.

Jamie Callan

HW2: Learning to Rank Due Feb 17, 11:59pm