Sanitising user input using Python

Here is a snippet that will remove all tags not on the white list, and all tag attributes not on the attribues whitelist (so you can’t use onclick).

It is a modified version of http://www.djangosnippets.org/snippets/205/, with the regex on the attribute values to prevent people from using href="https://stackoverflow.com/questions/16861/javascript:...", and other cases described at http://ha.ckers.org/xss.html.
(e.g. <a href="ja&#x09;vascript:alert('hi')"> or <a href="https://stackoverflow.com/questions/16861/ja vascript:alert("hi')">, etc.)

As you can see, it uses the (awesome) BeautifulSoup library.

import re
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup, Comment

def sanitizeHtml(value, base_url=None):
    rjs = r'[\s]*(&#x.{1,7})?'.join(list('javascript:'))
    rvb = r'[\s]*(&#x.{1,7})?'.join(list('vbscript:'))
    re_scripts = re.compile('(%s)|(%s)' % (rjs, rvb), re.IGNORECASE)
    validTags="p i strong b u a h1 h2 h3 pre br img".split()
    validAttrs="href src width height".split()
    urlAttrs="href src".split() # Attributes which should have a URL
    soup = BeautifulSoup(value)
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        # Get rid of comments
        comment.extract()
    for tag in soup.findAll(True):
        if tag.name not in validTags:
            tag.hidden = True
        attrs = tag.attrs
        tag.attrs = []
        for attr, val in attrs:
            if attr in validAttrs:
                val = re_scripts.sub('', val) # Remove scripts (vbs & js)
                if attr in urlAttrs:
                    val = urljoin(base_url, val) # Calculate the absolute url
                tag.attrs.append((attr, val))

    return soup.renderContents().decode('utf8')

As the other posters have said, pretty much all Python db libraries take care of SQL injection, so this should pretty much cover you.

Leave a Comment