In an earlier post we looked at a simple search system that could handle straightforward boolean combinations of words in a query. Much of the time we can treat even ‘natural’ searches like that, assuming that a search like php information retrieval is “look for any document containing the words php AND information AND retrieval”, but sometimes the user is searching for that specific phrase in that specific order.
Most search engines let you specify phrase searches by wrapping the query in quote marks: “php information retrieval”. Often this can make a query that would return a lot of noise, like to be or not to be into one that is quite targeted “to be or not to be”.
To implement phrase searches we need to store a bit more data on the posting list side of the system. Previously we were storing lists as a mapping of a term to each document that contained it, like docId1, docId2… We now want to add some extra data to each posting, by storing the position of every occurrence of the term, usually as a character offset. Our new posting list might look something like this:
The above would indicate that the term occurs three times in document 1, once 12 characters into the document, once 253 characters in, and finally 334 characters in. To actually implement the search we break the query down as usual, extracting the terms, but then we ensure that each term is within a certain number of characters from the previous one, to allow for some extra characters like doubled spaces or different punctuation being used.
This happens at the point where the posting lists of the terms are merged, which means we can’t use the standard php array_* functions as we did before. However, we don’t have to worry about the boolean operators, as including them in phrase searches themselves doesn’t make too much sense, so we only have to implement the intersection part.
Conveniently, PHP provides an array_uintersect function that allows us to specify our own comparison method. We retrieve all documents that contain both terms A and B, where B is within a certain number of characters of A. The function assumes both position lists are sorted.
The function compared the document ids to ensure we’re only matching within a given document. For ones that match we loop through the position list looking for entries within CHARACTER_TOLERANCE of each other, and return a 0 to indicate to the array_uintersect function that they’re equivalent.
The extra metadata adds substantially to our index though, which makes it more and more likely that there will be difficulty in building it in memory, but that’s an issue for another post!