BeautifulSoup return unexpected extra spaces

I believe this is a bug with Lxml’s HTML parser.
Try:

from bs4 import BeautifulSoup

import urllib2
html = urllib2.urlopen ("http://www.beppegrillo.it")
prova = html.read()
soup = BeautifulSoup(prova.replace('ISO-8859-1', 'utf-8'))
print soup

Which is a workaround for the problem.
I believe the issue was fixed in lxml 3.0 alpha 2 and lxml 2.3.6, so it could be worth checking whether you need to upgrade to a newer version.

If you want more info on the bug it was initially filed here:

https://bugs.launchpad.net/beautifulsoup/+bug/972466

Hope this helps,

Hayden

Leave a Comment