Beautiful Soup and Table Scraping - lxml vs html parser

Short answer.

If you already installed lxml, just use it.

html.parser – BeautifulSoup(markup, "html.parser")

Advantages: Batteries included, Decent speed, Lenient (as of Python
2.7.3 and 3.2.)
Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2)

lxml – BeautifulSoup(markup, "lxml")

Advantages: Very fast, Lenient
Disadvantages: External C dependency

html5lib – BeautifulSoup(markup, "html5lib")

Advantages: Extremely lenient, Parses pages the same way a web browser does, Creates valid HTML5
Disadvantages: Very slow, External Python dependency

More Related Contents:

Parsing HTML in python – lxml or BeautifulSoup? Which of these is better for what kinds of purposes?
Speeding up beautifulsoup
retrieve links from web page using python and BeautifulSoup [closed]
How to find elements by class
can we use XPath with BeautifulSoup?
Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org
Using BeautifulSoup to extract text without tags
Scraping Google Finance (BeautifulSoup)
How to scrape a website which requires login using python and beautifulsoup?
BeautifulSoup findAll() given multiple classes?
Difference between “findAll” and “find_all” in BeautifulSoup
Web scraping program cannot find element which I can see in the browser
bs4.FeatureNotFound: Couldn’t find a tree builder with the features you requested: lxml. Do you need to install a parser library?
BeautifulSoup webscraping find_all( ): finding exact match
Beautiful Soup 4 find_all don’t find links that Beautiful Soup 3 finds
BeautifulSoup returns empty list when searching by compound class names
How to extract a JSON object that was defined in a HTML page javascript block using Python?
How to change tag name with BeautifulSoup?
How to scrape only visible webpage text with BeautifulSoup?
How to find tag with particular text with Beautiful Soup?
Can bs4 get the dynamic content of a webpage if requests can’t?
Python regular expression for HTML parsing
How find specific data attribute from html tag in BeautifulSoup4?
BeautifulSoup: Get the contents of a specific table
Scrape Dynamic contents created by Javascript using Python
BeautifulSoup returns None even though the element exists
Clicking link using beautifulsoup in python
Download all pdf files from a website using Python
How to scrape dynamic webpages by Python
How to handle IncompleteRead: in python

Beautiful Soup and Table Scraping – lxml vs html parser

Leave a Comment Cancel reply