15-493 - Information Retrieval and Web Mining                                                                                                                                          Due Date: Part 1: November 11, 2009, Part 2: November 19, 2009
Jamie Callan
Yiming Yang


Assignment 3: Personalized PageRank



In this assignment, you will be developing the PageRank algorithm and its personalized variants and compare them empirically on the provided CiteEval dataset.

 

This assignment consists of 3 major sections:

  1. Implementation: Write a program to compute PageRank, Personalized PageRank, and Query-sensitive PageRank for the documents in the CiteEval dataset
  2. Retrieval: Perform retrieval for the provided user-query pairs
  3. Relevance Assessment: Assess the relevance judgments in the dataset


NOTE: This assignment is to be submitted in two parts. The first part, consisting of items A and B above, is due on November 9, 2009. The second part, consisting of item C above, is due on November 16, 2009. Allocate your time accordingly.

 

Both parts will require submission of the corresponding report. It is important to do a good job on your report, so please leave enough time to write the report for each part. Details for each of the deliverables follow.

 

A) Implementation

You will be implementing


1) Global PageRank (GPR) described on slide 16

2) Query-based Topic Sensitive PageRank (QTSPR) as described on slide 6

3) Peronalized Topic Sensitive PageRank (PTSPR). This is a personalized variant of 2) above. Instead of weighing the topic-sensitive pagerank for each topic with the query-topic distribution, we weigh it with the user's interest in that particular topic. i.e. P(j|q) is replaced by P(j|u).


The transition matrix of the CiteEval documents can be downloaded here. This document is in the sparse matrix format. i.e. each row denotes information for a cell. e.g. a row of the form "i j k" denotes that the matrix contains a value k at the row i column j.  The value k=1 denotes that there is a link from document i to document j. 


For the TSPR-based algorithms, you will also need the document classification information. This can be downloaded here. In this file, each row of the form "x y" denotes that document x belongs to the y category. 


For personalization, you will need each user's topical interest distribution P(j|u) for all topics. This can be downloaded here. Each row is of the form: 

"a b 1:x 2:y ........ 12:z". a represents the user, b represents the query by the user, and the items of the form 1:x denote that the P(1 | a) = x. In other words, each component in this line is p(j|u) for that user


For QTSPR, you will need each query's topical distribution P(j|q) for all topics. This can be downloaded here. Each row is of the form: 

"a b 1:x 2:y ........ 12:z". a represents the user, b represents the query by the user, and the items of the form 1:x denote that the P(1 | b) = x. In other words, each component in this line is p(j|q) for that query


You may use any matrix multiplication library suitable for your programming language of choice or implement your own matrix multiplication routine. 

 

Needless to say, you must strive to modularize your code to share the common components of the various PageRank algorithms.That will significantly reduce the programming effort.

 

B) Retrieval

For retrieval, you will be implementing variants of the Google's approach to combination of PageRank and search-relevance scores. (slide 21). One simple function is the weighted sum of the PageRank and search-relevance scores. Let's call this WS for weighted sum. In addition to this function, you should also implement a custom way of combining PageRank and search-relevance scores. Let's call this CM for custom method. Additionally, you may refrain from using a combination of scores, and just use the PageRank scores to rank the documents. Let's call this NS for No-search as query-specific search results are not included.


The search-relevance scores of each query can be downloaded here. This zipped archive contains several files, (one per query)  each containing the ranked-list returned by the Indri search engine for a particular query, with the corresponding retrieval scores. The document format is similar to the trec_eval format that you have used in the first two assignments.



Based on the provided dataset, you will be comparing nine approaches to PageRank based search. The table below depicts these nine comparisons:


Method  \   Weighing Scheme
NS
WS
CM
GPR



QTSPR



PTSPR





Thus, GPR will be implemented using three weighing schemes, NS, WS, and CM. Similarly for QTSPR and PTSPR, leading to nine total methods in your repertoire.

 

 

You will be reporting the following items for each of the compared approaches:

  1. MAP (Mean Average Precision) averaged over all the queries
  2. Precision at 11 standard recall levels (0%, 10%, …, 100% recall levels) averaged over all the queries
  3. Wall-clock running time in seconds (This should include PageRank computation time + retrieval time) averaged over all the queries.
  4. Values of chosen dampening factor, weighing factor (for combining scores) or any other parameters in your system

You will be using the trec_eval program to evaluate your system. The desired format is same as that of HW1 and HW2.

Additionally, you will also be analyzing the results, and stating your general observations about the various parameters in the system (iv above)

 

Submission Guidelines for sections  A and B: (due date: 11 November, 2009)

Your submission must include the following items:
1) Source code for your implementation of GPR, QTSPR, PTSPR, and each of the different weighing scheme
2) Details of Software implementation:
    1. Describe the software architecture of your implementation at a high-level.
    2. Describe any data structures you used for speeding up the computation of PageRank.
    3. Describes strengths and weaknesses of your implementation
3) Explanation of the custom weighing scheme that you have implemented
4) The empirical comparison of the 9 approaches, based on the various metrics described in section B
5) A analysis of the various algorithms, parameters, and general observations about using PageRank algorithms

Zip your source code and the report into a single archive. It should be named yourandrewid-hw3-part1.zip and submitted through the appropriate link on blackboard.


C) Relevance Assessment

CiteEval dataset is a personalized search dataset. The relevance judgments for each user-query pair have been provided only by that particular user. This is unlike TREC relevance judgments where relevance judgments are pooled in by a group of users. For this part of the assignment, the students should closely analyze the personalized relevance judgments for queries from at least 4 users (20 queries). Select your users such that for half of them you believe the search will benefit from personalization. You should report your relevance assessments through the web-interface (this interface will be accessible after November 11, 2009). You should report:

  1. Missing relevance judgments: In a personalized search environment, it is possible that documents relevant to one person may not be relevant to another. On similar lines, you will find that some documents that are relevant according to you for a query are missing from the provided relevance judgments.  You should identify such documents from the interface by clicking on the “Missed” button next to each document. Each missed document is a Miss, in your opinion, for that query.
  2. Excess relevance judgments: Similar to part i), you may not agree with the annotator about a particular relevance judgment. This means, a document relevant to the user may not be relevant according to you. You should identify such documents on the interface by clicking on the “Excess” button next to each document. Each excess document is a False Alarms, in your opinion, for that query.
  3. In your written text report, you should include the following table

The SIGIR 2008 paper To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent by Jaime Teevan, Susan T. Dumais and Daniel J. Liebling, describes click-entropy as a possible means of identifying whether a query can benefit from personalization or not. A Click here refers to the personalized bias of each user in clicking documents relevant to her in the search results. The paper describes a large scale user-study to support the hypothesis that if each user clicks on different search results for the same query (high click entropy) then that query can benefit from personalized search. In this assignment, high number of Misses and Excesses in parts i) and ii) denotes a disagreement between you and the original annotator. Do your observations support the claim presented in the paper? If so, how? If not, why not?  You must describe your analysis in the report.


Submission Guidelines for section C: (due date: 19 November, 2009)


1) Submit your relevance assessments through the web-interface.
2) In your report include the table described in C.iii above.
3) In your report, describe your observations in general:
    a) The queries that you tried, did they seem to benefit from personalization?
    b) For the queries that you believed should benefit from personalization, did they actually benefit (as seen from  your experimental results)
4) What could be some novel ways for search engines to estimate whether a query can benefit from personalization?
5) What could be some novel ways of identifying the user's interests (e.g. the user's topical interest distribution P(j|u) ) in general?