Can extended regex implementations parse HTML?

The answer to your question is that yes, so-called “extended regexes” — which are perhaps more properly called patterns than regular expressions in the formal sense — such as those found in Perl and PCRE are indeed capable of recursive descent parsing of context-free grammars.

This posting’s pair of approaches illustrate not so much theoretical as rather practical limits to applying regexes to X/HTML. The first approach given there, the one labelled naïve, is more like the sort you are apt to find in most programs that make such an attempt. This can be made to work on well-defined, non-generic X/HTML, often with very little effort. That is its best application, just as open-ended X/HTML is its worst.

The second approach, labelled wizardly, uses an actual grammar for parsing. As such, it is fully as powerful as any other grammatical approach. However, it is also far beyond the powers of the overwhelming majority of casual programmers. It also risks re-creating a perfectly fine wheel for negative benefit. I wrote it to show what can be done, but which under virtually no circumstances whatsoever ever should be done. I wanted to show people why they want to use a parser on open-ended X/HTML by showing them how devilishly hard it is to come even close to getting right even using some of the most powerful of pattern-matching facilities currently available.

Many have misread my posting as somehow advocating the opposite of what I am actually saying. Please make no mistake: I’m saying that it is far too complicated to use. It is a proof by counter-example. I had hoped that by showing how to do it with regexes, people would realize why they did not want to go down that road. While all things are possible, not all are expedient.

My personal rule of thumb is that if the required regex is of only the first category, I may well use it, but that if it requires the fully grammatical treatment of the second category, I use someone else’s already-written parser. So even though I can write a parser, I see no reason to do so, and plenty not to.

When carefully crafted for that explicit purpose, patterns can be more resisilient to malformed X/HTML than off-the-shelf parsers tend to be, particularly if you have no real opportunity to hack on said parsers to make them more resilient to the common failure cases that web browsers tend to tolerate but validators do not. However, the grammatical patterns I provide above were designed for only well-formed but reasonably generic HTML (albeit without entity replacement, which is easily enough added). Error recovery in parsers is a separate issue altogether, and by no means a pleasant one.

Patterns, especially the far more commonplace non-grammatical ones most people are used to seeing and using, are much better suited for grabbing up discrete chunks one at a time than they are for producing a full syntactic analysys. In other words, regexes usually work better for lexing than they do for parsing. Without grammatical regexes, you should not try parsing grammars.

But don’t take that too far. I certainly do not mean to imply that you should immediately turn to a full-blown parser just because you want to tackle something that is recursively defined. The easiest and perhaps most commonly seen example of this sort of thing is a pattern to detect nested items, like parentheses. It’s extremely common for me to just plop down something simple like this in my code, and be done with it:

# delete all nested parens
s/\((?:[^()]*+|(?0))*\)//g;

Leave a Comment