Is there any way to detect strings like putjbtghguhjjjanika?

You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a ‘h’ after a ‘t’ (pretty common). In English, you expect that after a ‘q’, you’ll get a ‘u’. If you get a ‘q’ followed by something other than a ‘u’, this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).

If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.

For background, read about Markov Chains.

Edit, I implemented this here in Python:

https://github.com/rrenaud/Gibberish-Detector

and buggedcom rewrote it in PHP:

https://github.com/buggedcom/Gibberish-Detector-PHP

my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True

Leave a Comment