Why are there so many different regular expression dialects?

Because regular expressions only have three operations:

  • Concatenation
  • Union |
  • Kleene closure *

Everything else is an extension or syntactic sugar, and so has no source for standardization. Things like capturing groups, backreferences, character classes, cardinality operations, etc are all additions to the original definition of regular expressions.

Some of these extensions make “regular expressions” no longer regular at all. They are able to decide non-regular languages because of these extras, but we still call them regular expressions regardless.

As people add more extensions, they will often try to use other, common variations of regular expressions. That’s why nearly every dialect uses X+ to mean “one or many Xs”, which itself is just a shortcut for writing XX*.

But when new features get added, there’s no basis for standardization, so someone has to make something up. If more than one group of designers come up with similar ideas at around the same time, they’ll have different dialects.

Leave a Comment