CS6120 Lab 04

Preparation

In this lab, you will write a PHP script that will make predictions using a user-based collaborative filter.

We'll be using some data from the MovieLens project. (Our thanks to them for making data available.) We have 80000 ratings (on a scale of 1-5), collected in 1997 and 1998, from 943 users for 1682 movies. Users who rated fewer than 20 movies have been removed. (We have an additional 20,000 ratings that we will be using in a later lab.)

I have stored the ratings (and some other information) in a MySQL database called cs6102_db1 and I have granted you read access.

You also need a copy of the following PHP script that I have written for you: cs6120_cf.php. Make sure you save it as cs6120_cf.php, not as cs6120_cf.phps.

Now you're ready to write your own script. It should begin like this:

<?php
 include('cs6120_cf.php');
 $cf = new cs6120_cf(
	'localhost', 'userid', 'password', 'cs6120_db1');
 if (! $cf->is_connected)
 {
 	die("Couldn't connect to database");
 }

?>

Replace userid by your user id and password by your MySQL password (not your normal login password).

This fragment of PHP does the following:

It loads in the PHP script that I provided. My script makes available to you a set of useful functions, which are explained in the Appendix of this lab sheet.
It then makes a connnection to the cs6120_db1 database using your user id and your password.
Finally, it checks that the connection is successfully made. If your script fails at this point, check that you have used the correct user id and password.

You will predict user 1's rating for movie 12.

Getting the neighbours

First you need to obtain user 1's nearest neighbours. To be more precise, you need neighbours who have rated item 12, since neighbours who haven't rated item 12 cannot help you to make the prediction. Don't worry! This is easy because it is one of the functions I have written for you...

Suppose you want to get 20 such neighbours. Then you use the following:

 
 $knn = $cf->get_k_nearest_users(20, 1, 12);

This places the neighbours into an array called $knn. I suggest you temporarily output this array, so you can see what it looks like:

 print_r($knn);

Read the description in the Appendix entitled cf->get_k_nearest_neighbours($k, $a_id, $i_id) to understand what's in the array.

Making the prediction

On slide 20 in Lecture 4, I gave you three formulae for making predictions. For now, use the first one, which is also the simplest. According to this formula, the prediction is simply the average of the neighbours' ratings for item 12. You can use a foreach loop to compute this prediction.

Testing your script

For your information, user 1's actual rating for this item is 5. Use print or echo to output the actual rating (5), your predicted rating (which, if your script is correct, will be 4.6), and the absolute error.

If all is well, now modify your script to predict user 1's rating for movie 74. This time the actual rating is 1 and, if your script is correct, your prediction will be 3.25.

And now modify your script again to predict user 7's rating for movie 599. Unless you wrote your script very carefully, you may now have a problem. Fix it.

Next steps

Bearing in mind that evaluating predictions and recommendations will form the basis of the CS6120 assignment, you might like to try some or all of the following either now or in your own time:

Experiment with different values for k.
Extend your script to compute predictions using the other two formulae on slide 20 of Lecture 4.
Extend your script to make recommendations (rather than predictions): ways to do this were mentioned in the lecture (slide 22).
Read the Appendix and see if you can come up with other ways of making personalized, and even non-personalized, predictions and recommendations.

Appendix

Here is a list of the functions that I have written for your use. I assume that your script begins with the fragment of PHP that I showed you at the start of this lab sheet.

cf->get_k_nearest_users($k, $a_id, $i_id)

Returns the $k nearest neighbours of user $a_id who have rated item $i_id

The result is an array containing the nearest neighbours, in no particular order. The length of this array will be no more than $k and may be less than $k if there are insufficient users who both have items in common with $a_id and have rated $i_id