Fixing Catastrophic Backtracking in Regular Expression

Your current regex can be written as ^(?:[a-zA-Z]:\\|\\\\)([^\\\/\:*?<>"|]+\\?)+$: pay attention at the ? quantifier (it is equal to {0,1} limiting quantifier) after \\ inside a + quantified group.

Once such a pattern like (a+b?)+ is present inside a pattern, there is a high chance of a catastrophical backtracking. Everything is nice when there is a match, say, c:\12\34\aaaaaaaaaaaaaaaaaaa is matched fine, but once a char that is not allowed appears causing a no-match, (try adding * at the end, c:\12\34\aaaaaaaaaaaaaaaaaaa*), the issue will appear.

To solve this, the quantified subpatterns that can match the same text cannot follow one another in immediate succession. And using optional groups where each subpattern is obligatory enables this.

In most scenarios, you need to replace these pattern parts with unrolled a+(ba+)* (1 or more occurrences of a followed with 0 or more sequences of b (that is no longer optional by itself) and then 1 or more occurrences of a (so, between one a and the next a there must be a b). If you need to match an optional \ at the end (as ^(a+b?)+$ actually may match b at the end of the string), you need to add a b? at the end: a+(ba+)*b?.

So, translating this to your current scenario:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*$

or if the \ is allowed at the end:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*\\?$
                     |      a+       (   b       a+       )* b?

See how it fails gracefully upon a no match, or matches as expected.

As @anubhava suggests, you can further enhance the performance by using possessive quantifiers (or atomic groups instead, since, e.g. .NET regex engine does not support possessives) that disallow any backtracking into the grouped patterns. Once matched, these patterns are not re-tried, thus, failure may come much quicker:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*+\\?$
                                                            ^

or an atomic group example:

^(?:[a-zA-Z]:\\|\\\\)(?>[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*)\\?$
                     ^^^                                       ^                          

Note that : is not a special regex metacharacter and should not be escaped. Inside a character class, only -, ^, \ and ] usually require escaping, all others are not special there either.

See more about catastrophical backtracking at The Explosive Quantifier Trap.

Leave a Comment