One nice bit of search query functionality, particularly in boolean systems, is the wildcard match. If you aren’t sure whether the title you’re trying to remember contains the word academy, academic, academically, or academics then you might be well served by trying all four: academ*.
We can treat this problem as a kind of query expansion - as if the user had performed a query for all words that begin with ‘academ’, and we would get the union of the posting lists for each matching word as normal. To do this, we need to be able to efficiently retrieve all words that match the wildcard, which requires a change to the dictionary the terms are stored in.
Earlier we used a hash map (in the form of a normal PHP array) to store the dictionary. This works fine for quick lookups, but doesn’t allow us to find similar words, as even words that vary only by an accent, such as cliche and cliché will generate very different hashes.
Instead, we will use a tree structure that allows us to look for a word a letter at a time, called a trie, from retrieval - though usually pronounced ‘try’. A trie is structured so there is a node for each letter in a word. A trie containing the words ‘acts’, ‘actor’ and ‘ant’ would look like:
a / \ c n | | t t / \ s o | r
We can store a value with each word, to allow us to hold the document count and lookup for the posting list. General retrieval can be handled by just walking through the tree one node at a time, and the string one character at a time looking for matches at each level. If we get to the end of the string, we can then test whether that’s a word node.
Traditionally the different letters in a node would have been represented with a linked list, and the existence of a word determined by adding an extra non-used character, like $, to distinguish between the word ‘add’ and the letters a-d-d as part of the word ‘adder’. However, we can stick the whole lot in a series of nested php arrays, which while not as efficient are rather straightforward! This also lets us have a ‘value’ key on the node that represents the end of a word.
Prefix matching, with the wildcard at the end of the query term, can be handled by looking up the specified characters, e.g. looking for the node in the trie that represents ‘academ’ (if it exists), then traversing through all nodes below it, returning all the words found. Here’s a class that handles this, which uses references to traverse the tree, though recursion or an iterator would probably be a bit cleaner!
Using this for a wildcard at the end of the string is then straightforward:
The next trick we can use is to compute another dictionary, but this time in reverse. We store each term reversed, so the path to the term academic would be c-i-m-e-d-a-c-a. Using the same technique as before we can now search for all words that end with a certain phrase, allowing us to handle wildcards at the beginning of query terms, e.g. *cademically.
By maintaining both we can handle wildcards in the middle of terms, by simply splitting the term into one that ends with a wildcard, and one that begins with a wildcard. The two resulting expansions can then by AND’d together, to give the proper results. For example hap*ily would become the query hap* AND *ily, and be treated as above. If we wrap up the logic in a couple of functions, we can process a single wildcard expansion anywhere in the string in a fairly straightforward manner:
The results from the two var dumps look like this:
There is another technique, the permuterm index, that can allow these kind of queries to be expanded within a single tree lookup, which works by adding a character to represent the end of the string, then rotating that string through each position. This can be quicker, but rather space intensive. Another technique is to create a k-gram index, which will store sequences of letters - for example a the 2-grams for ‘example$’ (with dollar denoting the end of the string would be): ‘ex’, ‘xa’, ‘am’, ‘mp’, ‘pl’, ‘le’, ‘e$’, which can be queried similarly.
The biggest weakness of the trie is the amount of space it wastes - chances are there will be runs of characters with a word only at the end, meaning a bunch of extra levels of the tree that aren’t needed. The PATRICIA trie, or radix trie, attempts to address this by allowing the ‘decision’ at each note to be on something other than the first character, so each node stores an offset and the branch options (e.g. if the letter in position 5 is A, go down this branch).
For real world systems, this technique is likely to be combined with a B-Tree into an String B-Tree, which is structured to allow the dictionary to remain mostly on disk, and minimise the number of accesses needed to answer any given query - important when the size of the dictionary is greater than can be practically stored in memory. That said, the basic trie is still a quite useful, and simple, data structure.