Why is it such a bad idea to parse XML with regex? [closed]

The real trouble is nested tags. Nested tags are very difficult to handle with regular expressions. It’s possible with balanced matching, but that’s only available in .NET and maybe a couple other flavors. But even with the power of balanced matching, an ill-placed comment could potentially throw off the regular expression.

For example, this is a tricky one to parse…

<div>
    <div id="parse-this">
        <!-- oops</div> -->
        try to get this value with regex
    </div>
</div>

You could be chasing edge cases like this for hours with a regular expression, and maybe find a solution. But really, there’s no point when there are specialized XML, XHTML, and HTML parsers out there that do the job more reliably and efficiently.

Leave a Comment