Bash - Regex for HTML contents

Don’t parse XML/HTML with regex, use a proper XML/HTML parser.

theory :

According to the compiling theory, HTML can’t be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform… Not installed by default, xpath1

xpath installed via perl’s module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay’s Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python‘s lxml (from lxml import etree)

perl‘s XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

php‘s DOMXpath

Check: Using regular expressions with HTML tags

Example using xidel:

xidel -s "$currentURL" -e '//a/extract(@href,"id=(\d+)",1)'

Bash – Regex for HTML contents

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :

Leave a Comment Cancel reply

theory :

realLife©®™ everyday tool in a shell :

or you can use high level languages and proper libs, I think of :

More Related Contents:

Leave a Comment Cancel reply