python BeautifulSoup parsing table

Here you go: data = [] table = soup.find(‘table’, attrs={‘class’:’lineItemsTable’}) table_body = table.find(‘tbody’) rows = table_body.find_all(‘tr’) for row in rows: cols = row.find_all(‘td’) cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values This gives you: [ [u’1359711259′, u’SRF’, u’08/05/2013′, u’5310 4 AVE’, u’K’, u’19’, … Read more

BeautifulSoup Grab Visible Webpage Text

Try this: from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in [‘style’, ‘script’, ‘head’, ‘title’, ‘meta’, ‘[document]’]: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, ‘html.parser’) texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u” “.join(t.strip() for t in visible_texts) html = urllib.request.urlopen(‘http://www.nytimes.com/2009/12/21/us/21storm.html’).read() … Read more

UnicodeEncodeError: ‘charmap’ codec can’t encode characters

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code: with open(fname, “w”) as f: f.write(html) with this: with open(fname, “w”, encoding=”utf-8″) as f: f.write(html) If you need to support Python 2, then use this: import io with io.open(fname, “w”, encoding=”utf-8″) as f: f.write(html) … Read more

Extracting an attribute value with beautifulsoup

.find_all() returns list of all found elements, so: input_tag = soup.find_all(attrs={“name” : “stainfo”}) input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do: output = input_tag[0][‘value’] or use .find() method which returns only one (first) found element: input_tag = soup.find(attrs={“name”: “stainfo”}) output = input_tag[‘value’]

retrieve links from web page using python and BeautifulSoup [closed]

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more