In this assignment, you will be developing the PageRank algorithm and its personalized variants and compare them empirically on the provided CiteEval dataset.
This assignment consists of 3 major sections:
NOTE: This assignment is to be submitted in two parts. The first part, consisting of items A and B above, is due on November 9, 2009. The second part, consisting of item C above, is due on November 16, 2009. Allocate your time accordingly.
Both parts will require submission of the corresponding report. It is important to do a good job on your report, so please leave enough time to write the report for each part. Details for each of the deliverables follow.
You will be implementing
1) Global PageRank (GPR) described on slide 16
2) Query-based Topic Sensitive PageRank (QTSPR) as described on slide 6
3) Peronalized Topic Sensitive PageRank (PTSPR). This is a personalized variant of 2) above. Instead of weighing the topic-sensitive pagerank for each topic with the query-topic distribution, we weigh it with the user's interest in that particular topic. i.e. P(j|q) is replaced by P(j|u).
The transition matrix of the CiteEval documents can be downloaded here. This document is in the sparse matrix format. i.e. each row denotes information for a cell. e.g. a row of the form "i j k" denotes that the matrix contains a value k at the row i column j. The value k=1 denotes that there is a link from document i to document j.
For the TSPR-based algorithms, you will also need the document classification information. This can be downloaded here. In this file, each row of the form "x y" denotes that document x belongs to the y category.
For personalization, you will need each user's topical interest distribution P(j|u) for all topics. This can be downloaded here. Each row is of the form:
"a b 1:x 2:y ........ 12:z". a represents the user, b represents the query by the user, and the items of the form 1:x denote that the P(1 | a) = x. In other words, each component in this line is p(j|u) for that user
For QTSPR, you will need each query's topical distribution P(j|q) for all topics. This can be downloaded here. Each row is of the form:
"a b 1:x 2:y ........ 12:z". a represents the user, b represents the query by the user, and the items of the form 1:x denote that the P(1 | b) = x. In other words, each component in this line is p(j|q) for that query
You may use any matrix multiplication library suitable for your programming language of choice or implement your own matrix multiplication routine.
Needless to say, you must strive to modularize your code to share the common components of the various PageRank algorithms.That will significantly reduce the programming effort.
For retrieval, you will be implementing variants of the Google's approach to combination of PageRank and search-relevance scores. (slide 21). One simple function is the weighted sum of the PageRank and search-relevance scores. Let's call this WS for weighted sum. In addition to this function, you should also implement a custom way of combining PageRank and search-relevance scores. Let's call this CM for custom method. Additionally, you may refrain from using a combination of scores, and just use the PageRank scores to rank the documents. Let's call this NS for No-search as query-specific search results are not included.
The search-relevance scores of each query can be downloaded here. This zipped archive contains several files, (one per query) each containing the ranked-list returned by the Indri search engine for a particular query, with the corresponding retrieval scores. The document format is similar to the trec_eval format that you have used in the first two assignments.
Based on the provided dataset, you will be comparing nine approaches to PageRank based search. The table below depicts these nine comparisons:
Method \ Weighing Scheme
NS
WS
CM
GPR
QTSPR
PTSPR
Thus, GPR will be implemented using three weighing schemes, NS, WS, and CM. Similarly for QTSPR and PTSPR, leading to nine total methods in your repertoire.
You will be reporting the following items for each of the compared approaches:
CiteEval dataset is a personalized search dataset. The relevance judgments for each user-query pair have been provided only by that particular user. This is unlike TREC relevance judgments where relevance judgments are pooled in by a group of users. For this part of the assignment, the students should closely analyze the personalized relevance judgments for queries from at least 4 users (20 queries). Select your users such that for half of them you believe the search will benefit from personalization. You should report your relevance assessments through the web-interface (this interface will be accessible after November 11, 2009). You should report: