What kind of formal languages can modern regex engines parse?

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

  • Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
  • Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
  • If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don’t know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
  • Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

  • Matching the context-free language {a^n b^n, n>0}:

    /^(a(?1)?b)$/
    # or
    /^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x
    
  • Matching the context-sensitive language {a^n b^n c^n, n>0}:

    /^
        (?=(a(?-1)?b)c)
        a+(b(?-1)?c)
    $/x
    # or
    /^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x
    

Leave a Comment