In this lab, you will write a PHP script that will make predictions using a user-based collaborative filter.
We'll be using some data from the MovieLens project. (Our thanks to them for making data available.) We have 80000 ratings (on a scale of 1-5), collected in 1997 and 1998, from 943 users for 1682 movies. Users who rated fewer than 20 movies have been removed. (We have an additional 20,000 ratings that we will be using in a later lab.)
I have stored the ratings (and some other information) in a MySQL database called cs6102_db1 and I have granted you read access.
You also need a copy of the following PHP script that I have written for
you: cs6120_cf.php. Make sure you save it as
cs6120_cf.php
,
not as cs6120_cf.phps
.
Now you're ready to write your own script. It should begin like this:
<?php include('cs6120_cf.php'); $cf = new cs6120_cf( 'localhost', 'userid', 'password', 'cs6120_db1'); if (! $cf->is_connected) { die("Couldn't connect to database"); } ?>
Replace userid by your user id and password by your MySQL password (not your normal login password).
This fragment of PHP does the following:
You will predict user 1's rating for movie 12.
First you need to obtain user 1's nearest neighbours. To be more precise, you need neighbours who have rated item 12, since neighbours who haven't rated item 12 cannot help you to make the prediction. Don't worry! This is easy because it is one of the functions I have written for you...
Suppose you want to get 20 such neighbours. Then you use the following:
$knn = $cf->get_k_nearest_users(20, 1, 12);
This places the neighbours into an array called $knn
.
I suggest you temporarily output this array, so you can see what it
looks like:
print_r($knn);
Read the description in the Appendix entitled
cf->get_k_nearest_neighbours($k, $a_id, $i_id)
to understand what's in the array.
On slide 20 in Lecture 4, I gave you three formulae for making predictions.
For now, use the first one, which is also the simplest. According to this
formula, the prediction is simply the average of the neighbours' ratings for
item 12. You can use a foreach
loop to compute this prediction.
For your information, user 1's
actual rating for this item is 5.
Use print
or echo
to output the actual
rating (5), your predicted rating (which, if your script is correct, will be
4.6), and the absolute error.
If all is well, now modify your script to predict user 1's rating for movie 74. This time the actual rating is 1 and, if your script is correct, your prediction will be 3.25.
And now modify your script again to predict user 7's rating for movie 599. Unless you wrote your script very carefully, you may now have a problem. Fix it.
Bearing in mind that evaluating predictions and recommendations will form the basis of the CS6120 assignment, you might like to try some or all of the following either now or in your own time:
Here is a list of the functions that I have written for your use. I assume that your script begins with the fragment of PHP that I showed you at the start of this lab sheet.
cf->get_k_nearest_users($k, $a_id, $i_id)
Returns the $k
nearest neighbours of user $a_id
who have rated item $i_id
The result is an array containing the nearest neighbours, in no particular order.
The length of this array will be no more than $k
and may be less
than $k
if there are insufficient users who both have items in common
with $a_id
and have rated $i_id
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
u_i_rating
: user $u_id
's rating for item
$i_id
cf->get_k_nearest_users($k, $a_id)
Returns the $k
nearest neighbours of user $a_id
The result is an array containing the nearest neighbours, in no particular order.
The length of this array will be no more than $k
and may be less
than $k
if there are insufficient users who have items in common
with $a_id
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
cf->get_thresholded_nearest_users($threshold, $a_id, $i_id)
Returns all users whose degree of similarity to user $a_id
exceeds $threshold
and who have rated item $i_id
The result is an array containing the nearest neighbours, in no particular order.
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
u_i_rating
: user $u_id
's rating for item
$i_id
cf->get_thresholded_nearest_users($threshold, $a_id)
Returns all users whose degree of similarity to user $a_id
exceeds $threshold
The result is an array containing the nearest neighbours, in no particular order.
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
get_k_thresholded_nearest_users($k, $threshold, $a_id, $i_id)
Returns the $k
nearest neighbours of user
$a_id
provided their similarity to $a_id
exceeds
$threshold
and provided they have rated item $i_id
The result is an array containing the nearest neighbours, in no particular order.
The length of this array will be no more than $k
and may be less
than $k
if there are insufficient users who have items in common
with $a_id
, have sufficient similarity to $a_id
and have rated $i_id
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
u_i_rating
: user $u_id
's rating for item
$i_id
get_k_thresholded_nearest_users($k, $threshold, $a_id)
Returns the $k
nearest neighbours of user
$a_id
provided their similarity to $a_id
exceeds
$threshold
The result is an array containing the nearest neighbours, in no particular order.
The length of this array will be no more than $k
and may be less
than $k
if there are insufficient users who have items in common
with $a_id
and have rated $i_id
Each neighbour in the array is represented as an associative array, whose keys are as follows:
u_id
: the neighbour's user ida_u_sim
: the Pearson correlation of users $a_id
and $u_id
(i.e. their degree of similarity)a_mean
: user $a_id
's mean rating for the items
s/he has in common with user $u_id
u_mean
: similarly for user $u_id
cf->get_user_num_ratings($user_id)
Returns the number of items that user $user_id
has rated
cf->get_user_average_rating($user_id)
Returns the average of user $user_id
's ratings
cf->get_user_ratings($user_id)
Returns user $user_id
's ratings
The result is an array. Each item in the array is itself an associative array, whose keys are as follows:
item_id
: the id of the itemrating
: the user's rating for the item
cf->get_item_num_ratings($item_id)
Returns the number of users who have rated item $item_id
cf->get_item_average_rating($item_id)
Returns the average of item $item_id
's ratings
cf->get_item_ratings($item_id)
Returns item $item_id
's ratings
The result is an array. Each item in the array is itself an associative array, whose keys are as follows:
user_id
: the id of the userrating
: the user's rating for the item
cf->get_rating($user_id, $item_id)
Returns user userid
's rating for item $item_id
or false
if the user has not rated the item