Beautiful Soup and Table Scraping – lxml vs html parser

Short answer. If you already installed lxml, just use it. html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses … Read more

Why doesn’t xpath work when processing an XHTML document with lxml (in python)?

The problem is the namespaces. When parsed as XML, the img tag is in the http://www.w3.org/1999/xhtml namespace since that is the default namespace for the element. You are asking for the img tag in no namespace. Try this: >>> tree.getroot().xpath( … “//xhtml:img”, … namespaces={‘xhtml’:’http://www.w3.org/1999/xhtml’} … ) [<Element {http://www.w3.org/1999/xhtml}img at 11a29e0>]