Remove HTML tags not on an allowed list from a Python string

Use lxml.html.clean! It’s VERY easy!

from lxml.html.clean import clean_html
print clean_html(html)

Suppose the following html:

html=""'\
<html>
 <head>
   <script type="text/javascript" src="https://stackoverflow.com/questions/699468/evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="https://stackoverflow.com/questions/699468/evil-site"></iframe>
   <form action="https://stackoverflow.com/questions/699468/evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="https://stackoverflow.com/questions/699468/evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

The results…

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="https://stackoverflow.com/questions/699468/evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

You can customize the elements you want to clean and whatnot.

Leave a Comment