Simple AlphaNumeric Regex (single spacing) without Catastrophic Backtracking

Explanation

Nested group doesn’t automatically causes catastrophic backtracking. In your case, it is because your regex degenerates to the classical example of catastrophic backtracking (a*)*.

Since \s in optional in ^([a-zA-Z0-9'-]+\s?)*$, on input without any spaces but has characters outside the allowed list, the regex simply degenerates to ^([a-zA-Z0-9'-]+)*$.

You can also think in term of expansion of the original regex:

[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?[a-zA-Z0-9'-]+\s?...

Since \s is optional, we can remove it:

[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+[a-zA-Z0-9'-]+...

And we got a series of consecutive [a-zA-Z0-9'-]+, which will try all ways to distribute the characters between themselves and blow up the complexity.

Solution

The standard way to write a regex to match token delimiter token ... delimiter token is token (delimiter token)*. While it is possible to rewrite the regex avoid repeating token, I’d recommend against it, since it is harder to get it right. To avoid repetition , you might want to construct the regex by string concatenation instead.

Following the recipe above:

^[a-zA-Z0-9'-]+(\s[a-zA-Z0-9'-]+)*$

Although you can see repetition in repetition here, there is no catastrophic backtracking, since the regex can only expand to:

[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+\s[a-zA-Z0-9'-]+...

And \s and [a-zA-Z0-9'-] are mutual exclusive – there is only one way to match any string.

Leave a Comment