Why does a simple .*? non-greedy regex greedily include additional characters before a match?

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right – from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking “Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat”. I expected it to be “Are there matches anywhere in this string? If so, match the shortest of all of them”.


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning “not the left-side match”. To negate a string like this requires negative lookaheads and non-capturing groups, but it’s as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO – Does this match the left side?
  • (?:(?!HOHO).)* – If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ – If so, grab everything until the right-hand match
  • ? modifier on * or + – If there are multiple right-hand matches, choose the nearest one

Leave a Comment