Benfords Law

Benfords Law is not an exciting new John Nettles based detective show, but an interesting observation about the distribution of the first digit in sets of numbers originating from various processes. It says, roughly, that in a big collection of data you should expect to see a number starting with 1 about 30% of the time, but starting with 9 only about 5% of the time. Precisely, the proportion for a given digit can be worked out as:

Read More

Monte Carlo Simulations

Monte Carlo simulations are a handy tool for looking at situations that have some aspect of uncertainty, by modelling them with a pseudo-random element and conducting a large number of trials. There isn’t a hard and fast Monte Carlo algorithm, but the process generally goes: start with a situation you wish to model, write a program to describe it that includes a random input, run that program many times, and look at the results.

Read More

Bayesian Opinion Mining

The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people’s opinions from all this raw text can be a very powerful one, and it’s a well studied area - no doubt because of the commercial possibilities.

Read More

PageRank In PHP

Google was a better search engine than it’s predecessors for a number of reasons, but probably the most well known one is PageRank, the algorithm for measuring the importance of a page based on what links to it. Though not necessarily that useful on its own, this kind of link analysis can be very helpful as part of a general information retrieval system, or when looking at any kind of network, such as a friend graph from a social network.

Read More

Text Generation

After a rather technical post last week, something a bit lighter. Text and language generation is a fun topic with applications that run from randomly generating scientific papers for conferences, to the practical tasks of generating speech and automated responses. In this post we’ll look at how we can generate some nonsense text based on existing documents, which isn’t on the overly practical side, though it can make a fun change from Lorem Ipsum for holding copy. The code is throughout, but you can also grab the lot in a zip.

Read More

Support Vector Machines In PHP

When it comes to classification, and machine learning in general, at the head of the pack there’s often a Support Vector Machine based method. In this post we’ll look at what SVMs do and how they work, and as usual there’s a some example code. However, even a simple PHP only SVM implementation is a little bit long, so this time the complete source is available separately in a zip file.

Read More

Part Of Speech Tagging

Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.

Read More

Language Detection With N-Grams

So far when we’ve been looking at text we’ve been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.

Read More

Alternative Term Weighting

The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We’re looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.

Read More