Use a DOM library, not regular expressions, when dealing with manipulating HTML:
- lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
- BeautifulSoup: a parser, document, and HTML serializer.
- html5lib: a parser. It has a serializer.
- ElementTree: a document object, and XML serializer
- cElementTree: a document object implemented as a C extension.
- HTMLParser: a parser.
- Genshi: includes a parser, document, and HTML serializer.
- xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.
Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.
Out of these I would recommend lxml, html5lib, and BeautifulSoup.