Optimization of MySQL search using “like” and wildcards

How long are your strings?

If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:

  • For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word value gives:
    • value
    • alue
    • lue
    • ue
    • e
  • Store each of these suffixes in the database.
  • You can now search for substrings using LIKE 'alu%' (which will find ‘alu’ as part of ‘value’).

By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.

Storage Cost

The number of characters required to store a word becomes word_len*word_len / 2, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:

  • 3-letter word: (3*3/2) / 3 = 1.5
  • 5-letter word: (5*5/2) / 5 = 2.5
  • 7-letter word: (7*7/2) / 7 = 3.5
  • 12-letter word: (12*12/2) / 12 = 6

The number of rows required to store a word increases from 1 to word_len. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.

Considerations

There is a trade-off in where we split ‘words’ (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter as one word or two?

The trade-off is as follows:

  • Anything that is broken up cannot be found as a single element. If we store five and letter separately, searching for five-letter or fiveletter will fail.
  • Anything that is not broken up will take more storage space. Remember, the storage
    requirement increases quadratically in the word length.

For convenience, you might want to remove the hyphen and store fiveletter. The word can now be found by searching five, letter, and fiveletter. (If you strip hyphens from any search query as well, users can still successfully find five-letter.)

Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.

Leave a Comment