How to handle IncompleteRead: in python

The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example: try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, … Read more

Extract content within a tag with BeautifulSoup

The contents operator works well for extracting text from <tag>text</tag> . <td>My home address</td> example: s=”<td>My home address</td>” soup = BeautifulSoup(s) td = soup.find(‘td’) #<td>My home address</td> td.contents #My home address <td><b>Address:</b></td> example: s=”<td><b>Address:</b></td>” soup = BeautifulSoup(s) td = soup.find(‘td’).find(‘b’) #<b>Address:</b> td.contents #Address:

Download all pdf files from a website using Python

Check out the following implementation. I’ve used requests module instead of urllib to do the download. Moreover, I’ve used .select() method instead of .find_all() to avoid using re. import os import requests from urllib.parse import urljoin from bs4 import BeautifulSoup url = “http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html” #If there is no such folder, the script will create one automatically … Read more

Beautiful Soup and Table Scraping – lxml vs html parser

Short answer. If you already installed lxml, just use it. html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses … Read more

Missing parts on Beautiful Soup results

BeautifulSoup can use different parsers to handle HTML input. The HTML input here is a little broken, and the default HTMLParser parser doesn’t handle it very well. Use the html5lib parser instead: >>> len(BeautifulSoup(r.text, ‘html’).find(‘td’, attrs={‘class’: ‘eelantext’}).find_all(‘p’)) 0 >>> len(BeautifulSoup(r.text, ‘lxml’).find(‘td’, attrs={‘class’: ‘eelantext’}).find_all(‘p’)) 0 >>> len(BeautifulSoup(r.text, ‘html5lib’).find(‘td’, attrs={‘class’: ‘eelantext’}).find_all(‘p’)) 22

Can I scrape the raw data from highcharts.js?

The data is in a script tag. You can get the script tag using bs4 and a regex. You could also extract the data using a regex but I like using /js2xml to parse js functions into a xml tree: from bs4 import BeautifulSoup import requests import re import js2xml soup = BeautifulSoup(requests.get(“http://www.worldweatheronline.com/brussels-weather-averages/be.aspx”).content, “html.parser”) script … Read more