Filter out HTML tags and resolve entities in python

Use lxml which is the best xml/html library for python.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

And if you just want to sanitize the html look at the lxml.html.clean module

Leave a Comment