11-741 - Information Retrieval
Jamie Callan
Yiming Yang
Due: Mar. 26, 2009, 11:59 PM

HOMEWORK 3

This assignment will give you experience implementing a collaborative filtering system for movie ratings. We will be using a subset of the data provided by Netflix for the Netflix Prize competition. For details and background of this dataset or the Netflix Prize competiton, see:

This assignment consists of three major parts:

Each part of the assignment will receive roughly equal credit during grading. This places a lot of emphasis on the report, so plan accordingly.

Dataset

The complete Netflix Prize dataset consists of over 100 million ratings from ~480 thousand users on ~17 thousand movies. It's a large dataset, requiring about 2G uncompressed disk space. For this reason, we've created a sample of the dataset for you to use on this assignment, including about 880,000 ratings. The Netflix Prize subset is distributed in exactly the same format as the original dataset, which should make it easy to try your algorithm on the full dataset if you desire (although this is not required for the assignment).

IMPORTANT IMPORTANT IMPORTANT IMPORTANT:
We have gotten permission from Netflix to re-distribute this data for classroom purposes, and we need to adhere to the following guidelines:

If you want to continue to work with the Netflix Prize data after class is over, you can sign up for an account at the Netflix Prize competition website and download the original dataset yourself.

Files to Download

For details on the file formats, see the README file included with the Netflix Prize subset.
(Note that since we have created a subset of the original dataset, the file sizes and checksums will not match with the ones mentioned in the README file)

The Program

Your program is responsible for doing the following:

  1. Reading in the data from the Netflix Prize subset
  2. Reading in a "query" file containing user-movie pairs
  3. Producing a prediction for the rating of each user-movie pair in the "query" file
  4. Write those predictions to a log file using the format specified below
You may want to pre-process the data to make it easier for your program to perform certain functions. Something similar to the inverted document index should immensely improve the scalability of your implementation. For example, in the original data format it is somewhat difficult to look up all the ratings by a given user. It is perfectly OK to do some pre-processing on the dataset and probably a good idea if you're planning on experimenting with a variety of rating algorithms. How you preprocess the data is up to you. In the report, please document anything you do to pre-process the dataset.

For more details on making a rating prediction, see the experiments section

Evaluating Results

The input to your program will be a "query" file containing movie-user pairs for you to make a prediction on. The format of this file is as follows:

MovieID1:
UserID11
UserID12
UserID13
...
MovieID2:
UserID21
UserID22
UserID23
...
Your program must read in this file and produce a prediction file of a similar format:
MovieID1:
prediction for UserID11
prediction for UserID12
prediction for UserID13
...
MovieID2:
prediction for UserID21
prediction for UserID22
prediction for UserID23
...
where the predictions are decimal values between 1.0 and 5.0. (Fractional ratings are fine). Note that it is critical that the order in this file match the input file.

The quality of predictions will be evaluated based on a metric typically used in collaborative filtering, Root Mean Squared Error (RMSE). RMSE is calculated as follows:

E = 0
For each rating
	E = E + (true_rating - rating)^2
RMSE = sqrt(E / num_ratings)

Automatic Evaluation

Upload your output log file using this web form, which will calculate the RSME for you. Make sure you use the correct form!

Corpus Exploration

In addition to running your collaborative filtering experiments, you must do some exploration of the dataset. In your report, include the following pieces of information:

  1. Provide the following corpus statistics:
    1. the total number of movies
    2. the total number of users
    3. the number of times any movie was rated '1'
    4. the number of times any movie was rated '3'
    5. the number of times any movie was rated '5'
    6. the average movie rating across all users and movies
  2. For user ID 1234576, provide the following:
    1. the number of movies rated
    2. the number of times the user gave a '1' rating
    3. the number of times the user gave a '3' rating
    4. the number of times the user gave a '5' rating
    5. the average movie rating for this user
  3. For movie ID 4321, provide the following:
    1. the number of users rating this movie
    2. the number of times the movie was rated '1'
    3. the number of times the movie was rated '3'
    4. the number of times the movie was rated '5'
    5. the average rating for this movie

The Experiments

You will be responsible for implementing several different rating-prediction algorithms. These should all be based on the k-Nearest Neighbors (kNN) algorithm, but similarity between users/movies will be calculated in several different ways and the rating prediction will be made in several different ways. A description of the general kNN algorithm for rating prediction is as follows:

Input: a movie, user pair: p = (movieA, userA)

  1. Find the k nearest-neighbors of p using a similarity metric of your choice. These neighbors will be a list of movie, user, rating triples:
    N={(m, u, r) | m = movie, u = user, r = rating in [1,5]}
  2. Produce a prediction of the rating for p based on the neighbors N
Output: a rating r in the range [1,5]

Your choices of how to implement steps (1) and (2) above are partially up to you. In class we talked about several different similarity metrics that could be used to find the nearest neighbors in step (1). You may not have talked about many ways to do (2), but here are several simple ideas:

Additionally, it is often wise to consider that different movies are inherently more popular than others (i.e. have a higher average rating) and different users are more generous than others (i.e. give a higher average rating). If we would like to take this into account with our rating prediction step, we need to account for this movie- or user-bias. Just like Pearson's correlation coefficient takes into account the movie or user mean rating when calculating similarity between movies or users, we can normalize for this rating bias when making the prediciton. We are leaving the exact formulation of this normalization as an exercise for you in this assignment. Please include the details of your implementation in your report.

You must implement 4 specific variations on the above general algorithm. The variations on this algorithm that you must implement are:

  1. Find neighbors for step (1) by using a user-user similarity metric of your chouce. Use the mean or weighted mean rating for step (2).
  2. Find neighbors for step (1) by using a movie-movie similarity metric of your choice. Use the mean or weighted mean rating for step (2).
  3. One of the above for step (1), but apply user-rating and/or movie-rating normalization for step (2).
  4. A custom algorithm of your choice. You may extend one of the above algorithms or come up with your own. Please include the details of your implementation in your report.
You must conduct four tests with your program for each of the above rating algorithms, producing a ratings log file for each of the ratings algoritms described above. For each experiment you must report the "wall clock" running time and the RMSE as computed by the evaluation script.

kNN is a slow algorithm, and you will definitely notice that processing the queries file is time consuming. There are a few tips we have added in the FAQ about making your code faster.

What to Turn In

You must turn in a written report, in txt, Word, or pdf format. Your report must contain the following sections, each clearly labeled as an independent section. Please include your name and Andrew ID at the top of the first page of your report.

  1. Corpus Exploration: The results of your corpus exploration as described above.
  2. The three basic rating algorithms: A description of your three basic rating algorithms, including details on the similarity metric used, the normalizations performed, and anything interesting you encountered while implementing these.
  3. The custom rating algorithm: A detailed description of your custom rating algorithm, again including the similarity metric used, any normalizations, and anything interesting you encoundered while implementing it. Include your motivation behind the algorithm: what problem were you trying to solve with this algorithm? how were you trying to improve upon the basic algorithms? what worked (or didn't work) for this algorithm? can you explain why you saw improvement (or not) over the basic algorithms?
  4. Complete experimental results: Discuss the complete set of experimental results, comparing the algorithms to each other. For each algorithm, include the RMSE and running time results described above. Discuss your observations about the various algorithms, i.e., differences in how they performed, what worked well and didn't, patterns/trends you observed across the set of experiments, etc. This is your chance to show what you learned from this homework assignment - take this section seriously.
    NOTE: You must report the results on the full query set. The smaller query set is only for your own convenience. See the FAQ for more details.
  5. The software implementation & data preprocessing:
    1. a description of what you did to pre-process the dataset to make your implementations easier or more efficient.
    2. a description of your design decisions and high-level software architecture;
    3. a description of major data structures (if any);
    4. any programming tools or libraries that you used;
    5. strengths and weaknesses of your design, and any problems that your system encountered
  6. Source Code: Include your source code together with your report. The TA will look at your source code, so make sure that it is legible, has reasonable documentation, and can be understood by others. This is a Computer Science class - we do actually care about your source code.
Please make it easy for the TAs to see how you have addressed each of the requirements described for each section.

Restrictions

  1. Your system must do this task entirely automatically. No manual intervention is allowed.
  2. There is no restriction on the programming language used: C, C++, Java, Perl, Python, Ruby, Matlab, Mathematica or R are all fine choices. Do what makes this easiest for you. But, please remember to document your code.
  3. You must write all of the software yourself.

FAQ

If you have questions not answered here, see the Frequently Asked Questions file. If your question is not answered there, please contact the TA or the instructor.


Copyright 2009, Carnegie Mellon University.
Updated on Feb 25, 2009.
lezhao@cs