Text Classification (And Twitter)

Classification techniques are used for spam filters, author identification, intrusion detection and a host of other applications. They can be used to help organise data into a structure, or to add tags to allow users to find documents. While the latest classification algorithms are at the cutting edge of machine learning, there are still thousands of systems using simpler algorithms to great effect.

Tries And Wildcards

One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren’t sure whether the title you’re trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.

Simple Search: Phrases

In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even ‘natural’ searches like that, assuming that a search like php information retrieval is “look for any document containing the words php AND information AND retrieval”, but sometimes the user is searching for that specific phrase in that specific order.

Simple Search: The Vector Space Model

One of the issues with the boolean search model is that results are unranked - every matching document for a query contains all of the terms in that query, and there’s no real way of saying which are ‘better’. However, if we could weight the terms in a document based on how representative they were of the document as a whole, we could order our results by the ones that were the best match for the query. This is the idea that forms the basis for the vector space model.

Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don’t even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation.

Block Based External Sort

Memory isn’t something that we have to worry about very much in PHP, as memory management is handled for us by the Zend engine. However, when it does become an issue it becomes a very big one - most PHP script are limited as to how much memory they can consume. While this makes a lot of sense for web processes, and is in general not a problem, when you have a lot of data to deal with it can make life difficult.

How To Use Your Business Cards

I got some new business cards from work the other day, and they came in the box direct from the printer, which along with the usual ad for themselves included an instruction manual for the cards. Admittedly much of the advice involves giving out as many business cards as possible, something they might be expected to encourage, but there were a couple I wouldn't have guessed. Some choice examples:

Insert A Business Card When Paying Bills. Bills contain advertisements. Why can't you advertise your skills or services the same way? Insert a business card with you payment. You may not think anyone who opens your credit card bill payment can help you - NEVER underestimate the power of networking. As the movies 6 degrees of separation points out we are six people away from knowing someone of influence. You could be six people away from knowing the Prime Minister or The Queen.

Ask For Referrals. When giving a business card people will feel more comfortable if you say "If you know anyone that could use my services, give them my card" [...] This always places you in a better position - they will feel better about helping you and you should give them at least two cards.

Use Proper Business Card Etiquette. When you give a business card, ask for a business card. When given a business card don't just take it and put it in your wallet. Make the person feel important by looking at their card for a few seconds - you may see something that could be a topic of discussion. Write comments on the card such as date, location and common points of interest. These comments will prove valuable when following-up with that person. This also demonstrates a sincere interest in the other person. Only then should you place the card in your wallet.

Simple Search: Boolean Retrieval

If you asked most people how a search engine worked, their answer would likely be a far cry from the acres of servers and vast collections that Google queries millions of times a day. That said, the intuitive view of a search engine is in many ways just a series of incremental steps away from Mountain View.

I am a fan of the truly odd Radio 1 DJ and Pimp My Ride UK presenter Tim Westwood, and this has only been enhanced by the wonderful glimpse of his life you get from his Twitter. I've collected some of my favourite moments:

Needed to wash the curtains - they've all shrunk by a foot! Now they don't even cover the windows! That's fucked up! - TimWestwood

Got in late last nite. Nicked a bottle of milk from the building next store. The top came off & spilt all over my best jacket. Bloody karma! - TimWestwood

I don't understand why some shops are still shut today - I guess its another kebab from edgware road - same place I ate Christmas Day - TimWestwood

And this wonderful sequence of taking the office to Nando's tweets.

I've asked everyone at the office to have a late lunch - cos there's no sense goin to Nandos super-hungry cos I'm payin - TimWestwood

just brought some bars of chocolate – goin to give them to everyone just we before we go Nando’s - TimWestwood

ha - got hot cross buns! Gonna serve them at 7pm - by time we get to nando's everyone will only be ordering 1/4 chickens & no desert - TimWestwood

Those hot cross buns went down a treat & now some chocolate for the non Christians. Also we goin to share one bottomless coke glass - TimWestwood

One of the team asked if they could bring their girl tonite - not on my bloody tab! How ghetto is that! - TimWestwood

this better not cost more £35 in total for the 7 of us - was hoping a couple of them would cancel! - TimWestwood

Mad love for the big dog.

