Can I improve performance of this regular expression further

Brief explanation and a solution

The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there’s more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

If you need to match strings like " + no quotes here + ", use

"([^"]*+)"

or even you do not need to match the trailing quote in this situation:

"([^"]*+)

See regex demo

In fact I am not able to guess how can we describe this regex verbally.

The latter "([^"]*+) regex can be described as

  • " – find the first " symbol from the left of the string
  • ([^"]*+) – match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.

Quantifiers

More information on quantifiers from Rexegg.com:

A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

As you see, ? is not a separate quantifier, it is a part of another quantifier.

I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

Difference between .*?, .* and [^"]*+ quantifiers

  • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.

enter image description here

Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

enter image description here

  • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].

Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be “re-sorted” due to some “inconveniences” met by the regex engine, and it will be unable to shift any characters from and into this block of text.

For the current expression, it does not make a big difference though.

enter image description here

Leave a Comment