Bash – Regex for HTML contents

Don’t parse XML/HTML with regex, use a proper XML/HTML parser.

theory :

According to the compiling theory, HTML can’t be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform… Not installed by default, xpath1

xpath installed via perl’s module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay’s Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

‘s lxml (from lxml import etree)

‘s XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

‘s DOMXpath


Check: Using regular expressions with HTML tags


Example using :

xidel -s "$currentURL" -e '//a/extract(@href,"id=(\d+)",1)'

Leave a Comment