Simple Collaborative Filtering

· October 6, 2009

Collaborative filtering is a way of trying to present more relevant information to users, by choosing what to show based on how other users have acted. The “You might also like” boxes on Amazon and similar are probably the most popular example of this kind of technology, but it’s applicable in recommending all kind of content outside of ecommerce, particular anything that often has user ratings, such as video or games.

What the system needs is some way of allowing users to express preferences for certain items This could be making a purchase, clicking a link, or rating the item for example. We can score the similarity between pairs of items by using the Pearson correlation coefficient which gives a value to how reliably users’ scores change together. If the value for X increase as Y increases, so a user that rates X highly also rates Y highly, even if they don’t increase at the same rate, the score will be high towards 1. If one consistently decreases as the other increases the score will be negative, towards -1, and if there’s no relationship it will be zero.

This function generates the score for two lists, assumed to be in the form array(identifier => score). It will return a score between -1 and 1.

<?php
function pearson($item1ratings, $item2ratings) {
        $n = $sum1 = $sum2 = $sumSq1 = $sumSq2 = $product = 0;

        foreach($item1ratings as $user => $score) {
                if(!isset($item2ratings[$user])) {
                        continue;
                }
                
                $n++;
                $sum1 += $score;
                $sum2 += $item2ratings[$user];
                $sumSq1 += pow($score, 2);
                $sumSq2 += pow($item2ratings[$user], 2); 
                $product += $score * $item2ratings[$user];
        }

        // No shared ratings, so the similarity is 0
        // May want to tweak this to have a different minimum
        if($n == 0) {
                return 0;
        }

        // Work out the actual score
        $num = $product - (($sum1* $sum2)/$n);
        $den = sqrt(($sumSq1-pow($sum1, 2) / $n) * ($sumSq2 - pow($sum2, 2)/$n));

        if($den == 0) {
                return 0;
        }

        return $num/$den;
}
?>

We can use this by getting a list of all reviews and items and comparing every pair. This is an expensive operation, but doesn’t need to be done all that often - it’s certainly something that could be batched up overnight for example. What we will end up with is a matrix of scores for many pairs of products - in a production system this would be in a database. In our example data we have a range of scores for various items from different users, ranging from 1 to 5.

<?php
// Example data
$data = array(
        'item1' => array(1 => 5, 2 => 3, 3 => 5, 4 => 1),
        'item2' => array(1 => 1, 2 => 2, 3 => 3, 4 => 4),
        'item3' => array(2 => 4, 3 => 5, 4 => 2),
        'item4' => array(2 => 3, 3 => 5, 4 => 1, 5 => 1),
        'item5' => array(2 => 3, 3 => 5, 4 => 1)
);

function generateSimilarities($data) {
        $similarities = array();

        foreach($data as $item => $scores) {
                $similarities[$item] = array();

                foreach($data as $item2 => $scores2) {
                        if($item2 == $item || isset($similarities[$item][$item2])) {
                                continue;
                        }

                        $sim = pearson($scores, $scores2);
                        if($sim > 0) { // minimum similarity 
                                $similarities[$item][$item2] = $sim;
                                $similarities[$item2][$item] = $sim;
                        }
                }
                arsort($similarities[$item]);
        }
        return $similarities;
}
?>

We can then make recommendations based on what they’re looking at. If we don’t know anything about the user, we can just recommend similar items to the one they’re looking at by taking the top scores out of the array. This can be used to suggest items on the product page itself, for example, or to suggest other videos users might like to watch after watching one.

<?php
function getSimilar($item, $similarities) {
                $val = each($similarities[$item]);
                return $val['key'];
}
?>

However, if the user has expressed some preferences themselves, then we can use that to try and find something else they might liked. For example, if the user has rated two items, A, which she rated 5, and B, which she rated one, we can then work out what to suggest by looking at the similarity. If X is 0.4 similar to A, and 0.3 similar to B it gets a score of 0.4 * 5 + 0.3 * 1 = 2.3. We then divide by the total similarities to get the user’s estimated score: 2.3 / 0.7 = 3.3. If Y is 0.3 similar to A, but only 0.1 similar to B, it gets a score of 0.3 * 5 + 0.1 * 1 = 1.6, divided by 0.4 to get 4.

<?php
function recommend($user, $data, $similarities, $top = 5) {
        $out = array();
        $items = array();

        foreach($data as $item => $scores) {
                if(isset($scores[$user])) {
                        $items[$item] = $scores[$user];
                }
        }
        
        foreach($items as $item => $score) {
                foreach($similarities[$item] as $sim_item => $sim) {
                        if(isset($items[$sim_item])) {
                                continue; // they already rated this
                        }
                                
                        if(!isset($out[$sim_item])) {
                                $out[$sim_item] == 0;
                        }
                
                        $out[$sim_item] += $sim * $score;
                        $sims[$sim_item] += $sim;
                }
        }
        
        foreach($out as $item => $score) {
                $out[$item] = $score / $sims[$item];
        }

        arsort($out);
        return array_slice($out, 0, $top); // return top n
}
?>

There are plenty of extensions possible once the data has been calculated, such as finding similar users for last.fm style neighbour suggestions. The Pearson coefficient is hardly the only similarity score as well - particular for the boolean preferences (‘buy’ or ‘click’ for example) a set overlap like the Jacquard coefficient would do the job just as well, and doesn’t involve any square roots. At a more advanced level clustering techniques could be used to provide even better recommendations, though at the cost of more complexity.