I was recently working with a client where I was evaluating an implementation, and some of the members of the team who inherited a solution were concerned about an implementation of wildcarding. It turns out the implementation was correct – but the concerns that the client had were common. So let’s take a look at what it takes to do search wildcarding Front-to-Back and Back-to-Front.
What is Wildcarding?
Before we get too far, it’s important to explain what I mean by wildcarding. The short answer is that we’re looking for the pattern of characters provided anywhere in the text being searched. In most cases, when we’re doing this searching, we’re doing it from the start of a word or words – because, in truth, our brains work this way; but occasionally there are times when it might make sense to search for the characters beginning anywhere in a word. Algorithmically, this is a more difficult challenge to solve. As a result, most search engines support wildcarding only front-to-back instead of anywhere in a string.
To understand the algorithmic problem, it’s helpful to view a simplified view of what SQL has to do to solve the wildcarding problem in a single field.
SQL has supported both forward and backwards wildcarding through the LIKE keyword for some time, so it’s common to assume it should “just work” in search as well. However, some sorts of wildcard operations in SQL are very operationally expensive. Let’s assume we’ve got a database table named Books and it has a field named Title. If I don’t have any indexes on the Title field and I use Title in the WHERE clause of the SQL statement, SQL will perform a full table scan of the table. Operationally, doing full table scans are expensive, and we work hard to prevent them in SQL. We do this by adding indexes.
If we add an index to the Title field, we get an ordered list of titles. With this information, if we’re searching for a specific title, we can start in the middle of the index and move forwards or backwards jumps (continuing to try to get quickly to the right place in the index) until we find the specific title we’re looking for. The index contains a row identifier in the main data table and we can read out the rest of the data we need from the main table quickly.
Simplifying out some optimizations that SQL does if we have a table of 100 records, without the index, SQL has to read 100 records to find the title that we’re looking for (and ensure there are no other matches). With an index on Title and a specific query, the maximum number of reads would be 14. Breaking this down, we can find any record in the list by bifurcating the list. 128 records is 27, or seven reads for the index. If we don’t have all the fields in the index that we need then we need to go back to the main table to get the actual record – so another seven reads (maximum). These are worst-case scenarios, and in many ways I’ve really over-simplified the impacts of caching, paging, row identifiers, etc., but the fundamentals are there so we can get a sense for the power of indexing.
This improvement, which gets larger as the data set gets larger, relies on the ability to order the results in the index. This in turn means that we have to at least know what the first characters are so we can look up the rest. That’s the rub. To get the efficiencies in looking up data we have to order it, and we can’t order it if we don’t know the start.
So what happens when you provide a wildcard at the end of the string in SQL – nothing special. It still uses the index and just walks across all the rows that could match. What happens when there is a wildcard at the beginning of the LIKE value is that SQL gives up and does a full table scan – unless there’s a covering index.
Sidebar: A covering index is one that contains all of the fields needed to satisfy the query. Even if the index’s order can’t be used, it will sometimes be used instead of the data table, because less data would be read and it would therefore be somewhat more efficient. In our example, SQL would use our index on the Title field presuming we only asked for the title field. It might use it if we asked for additional fields. However, there’s still a full scan of the data we’re interested in happening somewhere.
While the indexing approach that search uses is different than SQL, it still obeys some of the same rules. It puts things in order to find them quickly.
Wildcarding from the Front
When you search with a wildcard at the front, it’s really very similar to a search without wildcarding. It finds the appropriate bits in the index and does some post filtering for security and returns them. Search is expecting to return multiple results. It simply includes entries from the index which it would have ignored because of the end of the term.
Search is fast because of the indexing process that is done. This indexing process, while substantially more intensive than creating a SQL index due to the volume of data involved, follows the same general data management principles. Indexes start at the front.
One of the improvements of search over SQL, from a data management perspective, is that search allows for a single property or field to have multiple values. This is appropriate because of fields like keyword fields, but also when multiple data fields are mapped to the same search property. For instance, the title of the document as well as a field in the data management system may be mapped to the same title property.
In SQL if you have a single field with multiple values, it gets indexed with the first value – which is why searching multiple valued fields in SQL is difficult, and why third form normalization pushes individual values into independent rows. Search is really managing the process that database designers do in SQL on its own. That’s a good thing. It gives us the opportunity to work around back-to-front wildcarding – for a subset of the properties. Let’s take a look at a license plate example to explain what we can do.
Partial Matching with License Plates
The classic data problem with wildcarding on both ends is the license plate match. The story is that a witness saw the license plate of a getaway vehicle, but unfortunately only managed to get three of the six digits on the license plate – and, more challenging, they don’t know which three digits they got. For simplicity, let’s say they observed the letters ABC. Those characters would match any of the following license plates:
When the SQL database is set up with fields for each character, you can transform the query in a way that does each of these searches. The result is that SQL can use a set of indexes to solve the query very efficiently (presuming the indexes are correct).
We can’t do this in search, so we flip this approach over and instead of transforming the query, we transform the data – using the idea of multiple values.
Partial Matching and Back-to-Front Wildcarding with Search
If you look back in the table above, you may notice something. That is, if you were to progressively remove characters from the beginning of each license plate, you could check for a match. For instance, let’s say that the bad plate is actually ZZABCD. We would store in a property the following:
In this case, if you were to search for ABC with a wildcard at the beginning and the end, you would find a match. More specifically, the third value (from the third row in the table would match). So if you can transform the incoming values such that you store a set of values for the property with progressively more leadings characters stripped, the resulting property will be searchable with wildcards on both sides.
In short, by transforming a property, we can get the desired effect for a given property – with a few side effects.
Impacts of Partial and Back-to-Front Wildcarding
The first and perhaps most obvious impact of this property transformation is that it increases the amount of storage in the index. As long as the property itself is relatively small, this isn’t generally a big deal. However, it does mean that you wouldn’t necessarily want to do this on every property – or, more to the point, you don’t.
The second impact is that this is a strategy that works for specific properties but doesn’t work for the full text of search. This is generally OK, because the cases where you need it are limited – but it’s not a completely generalizable workaround.
Finally, there will be some impacts to ranking and relevance by doing this which are search engine-specific. It’s possible that, after implementing this strategy, you’ll have shifted the relevance of those searches which query this property.
For these reasons, it’s still a good idea to consider the exact reasons why you believe that back-to-front wildcarding is appropriate for you, and why it might be better to consider the psychology of searching.
Except for relatively rare circumstances, our brains don’t work by picking out the middle of a string. We might be able to recall a segment of a license plate because it’s novel, but in most cases we simply don’t process information this way. We typically think of the start of a word or term, but don’t know how to put the rest of the word together. One exception to this is when we tokenize strings instead of processing them as words.
In many cases, the reason that we want to support back-to-front wildcarding is because the user did what happened in the license plate example – the license plate isn’t (generally) a word. It gets processed as groups of characters. One or more of the groupings may be accidentally or intentionally memorable, and the user doesn’t remember the rest of the string. For instance, in a part number like A#1264#CIRBRK#US, the CIRBRK portion of the string might be memorable and something someone would want to search on. In this case, the user isn’t really searching from an arbitrary starting point in the string, they’re starting from a breakpoint.
Breakpoints are where the string should naturally be broken. Search engines do this all the time with language to break the content into distinct words that can be searched for. This is controlled by word breakers.
Much of the problem that we’re facing, which is allowing users to search at the start of a word or a part of a longer string, has already been handled in the engine. Every search engine knows how to break strings into distinct words for indexing. What characters are used for breaking words can be language-dependent or set globally.
Some of the standard word breakers make sense. Consider carriage return, space, and tab are all obvious word breakers. However, depending upon the engine you’re using, hyphens, underscores, and other special characters may or may not be considered word breakers. If they’re not, then you get one long string of the value – if they are, it’s broken up into pieces.
Consider that the string you get for the part number: if # is not a breaker, the part number is the complete string. If # is a word breaker, the following gets indexed: A, 1264, CIRBRK, and US. In this case if I know that I’m looking for CIRBRK, it would match (as would CIRBRK with a wildcard at the end).
This is important because some implementations of back-to-front order aren’t necessary if the appropriate word breakers are in place. If the part number is A1264CIRBRKUS then you definitely need the back-to-front wildcarding approach described above. However, with separators, it’s more efficient to not transform the property. Like any rule there are exceptions.
Right to Left Exceptions
You may have considered that I’ve been speaking left-to-right as in most of the languages in use on the planet today. There are some languages which are processed right-to-left instead. In these cases, it’s easiest to think of the right-to-left read language having the characters flipped (inverted), so the last character comes first and the first becomes last. If you do this, then the characteristics are all the same. People in right-to-left languages tend to remember the right side (start) of the word not the left side (end). The psychology matches even if the symbols are reversed.