My friend Vincenzo recently posted up a review of academic work on clustering that he compiled while working at the University of Naples. It’s worth a look if you’re interested in the field, going from the basic methods all the way up to the latest techniques like Support Vector Clustering (which I believe you can read about in Enzo’s masters thesis).

Clustering, for those who haven’t encountered it, refers to partitioning some data into a number of groups, or clusters. It’s an unsupervised technique, unlike classification, as there aren’t any examples to learn which groups different types of data should go into. The ideal algorithms even attempt to work out how many groups there should be, but for a lot of the simpler techniques the number of clusters is an input parameter.

This can be useful for all sort of reasons, but one particular example Vincenzo gives is for guiding searches - for example a search for ‘tiger’ might give results as on a regular page, but also offer refinements for Tiger Woods, the tiger the animal, or the version of Mac OSX; allowing searchers to focus their queries on some particular grouping of results.

Probably the simplest method is the K-Means algorithm, which places k points in the data space then progressively averages them with the items around them. This is a flat clustering technique - there’s no relationship between the clusters, as opposed to (more advanced) hierarchical clustering algorithms. Below I’ve knocked up a brief implementation (download the whole source on GitHub), though for the example rather than using a whole document vector (a series of values that represents a document based on the words it contain) as you would for a real search engine, we have a set of simple 2 dimensional points.

Our first step is to randomly initialise the points around which we will cluster, referred to as the centroids. We try to stay inside the bounds of the data, as we know our points are going to be at the center of a circle, so we limit the rand function with the min and max. The function takes the data to cluster, formatted as above (though with any number of entries), and a value k to represent the number of clusters we want.

Once we have our starting points we can then look at the main meat of the algorithm, the loop of iteratively moving and testing the centroids to find those that fit the data.

The two new functions referenced in the main loop are assignCentroids, and updateCentroids. The first just takes the current centroids and maps data to the closest one, using the absolute distance between them. This is also our stopping condition - if no data swap clusters in an iteration, we assume we’re done and exit.

The second function, updateCentroids, just moves the centroid to the average of all the points that are ‘assigned’ to it. If we end up with an unassigned centroid, we simply generate some new random coordinates for it so it goes back into the pool.

Running the whole thing gives us the following output (or a var_dump on it does anyway):

The centroids, and the natural clustering of the data, is a bit easier to see on a chart:

You can see in the output array that the data has clustered as we’d expect it to, but that is in part because of the random starting positions that were chosen. If they had been slightly different we might have ended up with a different partitioning of the data, one which may be less optimal. It we were being clever we could use a mix of random and intentional points - place one random, one as far away from it as possible, the next as far away as possible from that one, and so on. This can be a bit of an effort if you’ve got a lot of dimensions, but will likely result in a more reliable algorithm.