Don’t parse XML/HTML with regex, use a proper XML/HTML parser.
theory :
According to the compiling theory, HTML can’t be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform… Not installed by default, xpath1
xpath installed via perl’s module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over @Michael Kay’s Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python‘s lxml
(from lxml import etree
)
perl‘s XML::LibXML
, XML::XPath
, XML::Twig::XPath
, HTML::TreeBuilder::XPath
Check: Using regular expressions with HTML tags
Example using xidel:
xidel -s "$currentURL" -e '//a/extract(@href,"id=(\d+)",1)'