11-741 - Information Retrieval
Jamie Callan
Yiming Yang |
Due: Mar. 26, 2009, 11:59 PM |
HOMEWORK 3
This assignment will give you experience implementing a collaborative filtering system for movie ratings. We will be using a subset of the data provided by Netflix for the Netflix Prize competition. For details and background of this dataset or the Netflix Prize competiton, see:
This assignment consists of three major parts:
- Write a program;
- Do some experiments with the program; and
- Write a report describing your program, the experiments,
and your conclusions.
Each part of the assignment will receive roughly equal credit during
grading. This places a lot of emphasis on the report,
so plan accordingly.
Dataset
The complete Netflix Prize dataset consists of over 100 million ratings from
~480 thousand users on ~17 thousand movies. It's a large dataset, requiring
about 2G uncompressed disk space. For this reason, we've created a sample of
the dataset for you to use on this assignment, including about 880,000
ratings. The Netflix
Prize subset is distributed in exactly the same format as the original
dataset, which should make it easy to try your algorithm on the full dataset
if you desire (although this is not required for the assignment).
IMPORTANT IMPORTANT IMPORTANT IMPORTANT:
We have gotten permission from Netflix to re-distribute this data for classroom purposes, and we need to adhere to the following guidelines:
- You MAY NOT distribute the Netflix Prize subset to anyone outside of the class
- You MUST remove the dataset from your computer when the class is over
- You MUST adhere to the guidelines in the README file included with the dataset
If you want to continue to work with the Netflix Prize data after class is over, you can sign up for an account at the Netflix Prize competition website and download the original dataset yourself.
Files to Download
- The "training" set -- Netflix Prize Subset.
- The "test" set -- a query file containing
movie-user pairs for which predictions need to be made.
- A smaller "test" set -- a smaller query file
, more amenable to faster implement-test-debug cycles.
For details on the file formats, see the README file included with the Netflix
Prize subset.
(Note that since we have created a subset of the original dataset, the file sizes and
checksums will not match with the ones mentioned in the README file)
The Program
Your program is responsible for doing the following:
- Reading in the data from the Netflix Prize subset
- Reading in a "query" file containing user-movie pairs
- Producing a prediction for the rating of each user-movie pair in the "query" file
- Write those predictions to a log file using the format specified below
You may want to pre-process the data to make it easier for your program to
perform certain functions. Something similar to the inverted document index
should immensely improve the scalability of your implementation.
For example, in the original data format
it is somewhat difficult to look up all the ratings by a given user. It is
perfectly OK to do some pre-processing on the dataset and probably a good idea
if you're planning on experimenting with a variety of rating algorithms. How
you preprocess the data is up to you. In the report, please document anything
you do to pre-process the dataset.
For more details on making a rating prediction, see the experiments section
The input to your program will be a "query" file containing movie-user pairs for you to make a prediction on. The format of this file is as follows:
MovieID1:
UserID11
UserID12
UserID13
...
MovieID2:
UserID21
UserID22
UserID23
...
Your program must read in this file and produce a prediction file of a similar format:
MovieID1:
prediction for UserID11
prediction for UserID12
prediction for UserID13
...
MovieID2:
prediction for UserID21
prediction for UserID22
prediction for UserID23
...
where the predictions are decimal values between 1.0 and 5.0. (Fractional ratings are fine). Note that it is critical that the order in this file match the input file.
The quality of predictions will be evaluated based on a metric typically used in collaborative filtering, Root Mean Squared Error (RMSE). RMSE is calculated as follows:
E = 0
For each rating
E = E + (true_rating - rating)^2
RMSE = sqrt(E / num_ratings)
Automatic Evaluation
Upload your output log file using this web form, which will calculate the RSME
for you. Make sure you use the correct form!
Corpus Exploration
In addition to running your collaborative filtering experiments, you must do some exploration of the dataset. In your report, include the following pieces of information:
- Provide the following corpus statistics:
- the total number of movies
- the total number of users
- the number of times any movie was rated '1'
- the number of times any movie was rated '3'
- the number of times any movie was rated '5'
- the average movie rating across all users and movies
- For user ID 1234576, provide the following:
- the number of movies rated
- the number of times the user gave a '1' rating
- the number of times the user gave a '3' rating
- the number of times the user gave a '5' rating
- the average movie rating for this user
- For movie ID 4321, provide the following:
- the number of users rating this movie
- the number of times the movie was rated '1'
- the number of times the movie was rated '3'
- the number of times the movie was rated '5'
- the average rating for this movie
You will be responsible for implementing several different rating-prediction
algorithms. These should all be based on the k-Nearest Neighbors (kNN)
algorithm, but similarity between users/movies will be calculated in several
different ways and the rating prediction will be made in several different
ways. A description of the general kNN algorithm for rating prediction is as
follows:
Input: a movie, user pair: p = (movieA, userA)
- Find the k nearest-neighbors of p using a similarity metric of your choice. These neighbors will be a list of movie, user, rating triples:
N={(m, u, r) | m = movie, u = user, r = rating in [1,5]}
- Produce a prediction of the rating for p based on the neighbors N
Output: a rating r in the range [1,5]
Your choices of how to implement steps (1) and (2) above are partially up to you. In class we talked about several different similarity metrics that could be used to find the nearest neighbors in step (1). You may not have talked about many ways to do (2), but here are several simple ideas:
- The mean rating for this movie among the neighbors, or
- The weighted mean rating for this movie among the neighbors, using the similarity measure from step (1) as the weight.
Additionally, it is often wise to consider that different movies are inherently more popular than others (i.e. have a higher average rating) and different users are more generous than others (i.e. give a higher average rating). If we would like to take this into account with our rating prediction step, we need to account for this movie- or user-bias. Just like Pearson's correlation coefficient takes into account the movie or user mean rating when calculating similarity between movies or users, we can normalize for this rating bias when making the prediciton. We are leaving the exact formulation of this normalization as an exercise for you in this assignment. Please include the details of your implementation in your report.
You must implement 4 specific variations on the above general algorithm. The variations on this algorithm that you must implement are:
- Find neighbors for step (1) by using a user-user similarity metric of your chouce. Use the mean or weighted mean rating for step (2).
- Find neighbors for step (1) by using a movie-movie similarity metric of your choice. Use the mean or weighted mean rating for step (2).
- One of the above for step (1), but apply user-rating and/or movie-rating normalization for step (2).
- A custom algorithm of your choice. You may extend one of the above algorithms or come up with your own. Please include the details of your implementation in your report.
You must conduct four tests with your program for each of the above rating algorithms, producing a ratings log file for each of the ratings algoritms described above. For each experiment you must report the "wall clock" running time and the RMSE as computed by the evaluation script.
kNN is a slow algorithm, and you will definitely notice that processing the
queries file is time consuming. There are a few tips we have added in the FAQ about making your code faster.
What to Turn In
You must turn in a written report, in txt, Word, or pdf format. Your
report must contain the following sections, each clearly labeled as
an independent section. Please include your name and Andrew ID
at the top of the first page of your report.
- Corpus Exploration: The results of your corpus exploration as described above.
- The three basic rating algorithms: A description of your three basic rating algorithms, including details on the similarity metric used, the normalizations performed, and anything interesting you encountered while implementing these.
- The custom rating algorithm: A detailed description of your custom rating algorithm, again including the similarity metric used, any normalizations, and anything interesting you encoundered while implementing it. Include your motivation behind the algorithm: what problem were you trying to solve with this algorithm? how were you trying to improve upon the basic algorithms? what worked (or didn't work) for this algorithm? can you explain why you saw improvement (or not) over the basic algorithms?
- Complete experimental results:
Discuss the complete set of experimental results, comparing the
algorithms to each other.
For each algorithm, include the RMSE and running time
results described above.
Discuss your observations about the
various algorithms, i.e., differences in how they performed,
what worked well and didn't, patterns/trends you observed across
the set of experiments, etc. This is your chance to show what you
learned from this homework assignment - take this section seriously.
NOTE: You must report the results on the full query
set. The smaller query set is only for your own convenience. See the FAQ
for more details.
- The software implementation & data preprocessing:
- a description of what you did to pre-process the dataset to make your implementations easier or more efficient.
- a description of your design decisions and high-level software architecture;
- a description of major data structures (if any);
- any programming tools or libraries that you used;
- strengths and weaknesses of your design, and
any problems that your system encountered
- Source Code:
Include your source code together with your report. The TA
will look at your source code, so make sure that it is legible,
has reasonable documentation, and can be understood by others.
This is a Computer Science class - we do actually care about
your source code.
Please make it easy for the TAs to see how you have addressed
each of the requirements described for each section.
Restrictions
- Your system must do this task entirely automatically.
No manual intervention is allowed.
- There is no restriction on the programming language used: C, C++, Java, Perl, Python, Ruby, Matlab, Mathematica or R are all fine choices. Do what makes this easiest for you. But, please remember to document your code.
- You must write all of the software yourself.
FAQ
If you have questions not answered here, see the
Frequently Asked Questions file. If your question is not answered
there, please contact the TA or the instructor.
Copyright 2009, Carnegie Mellon University.
Updated on Feb 25, 2009.
lezhao@cs