beautifulsoup - w3toppers.com

python BeautifulSoup parsing table

Here you go: data = [] table = soup.find(‘table’, attrs={‘class’:’lineItemsTable’}) table_body = table.find(‘tbody’) rows = table_body.find_all(‘tr’) for row in rows: cols = row.find_all(‘td’) cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values This gives you: [ [u’1359711259′, u’SRF’, u’08/05/2013′, u’5310 4 AVE’, u’K’, u’19’, … Read more

can we use XPath with BeautifulSoup?

Nope, BeautifulSoup, by itself, does not support XPath expressions. An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster. … Read more

How to remove \xa0 from string in Python?

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space. string = string.replace(u’\xa0′, u’ ‘) When .encode(‘utf-8’), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0. Read … Read more

BeautifulSoup Grab Visible Webpage Text

Try this: from bs4 import BeautifulSoup from bs4.element import Comment import urllib.request def tag_visible(element): if element.parent.name in [‘style’, ‘script’, ‘head’, ‘title’, ‘meta’, ‘[document]’]: return False if isinstance(element, Comment): return False return True def text_from_html(body): soup = BeautifulSoup(body, ‘html.parser’) texts = soup.findAll(text=True) visible_texts = filter(tag_visible, texts) return u” “.join(t.strip() for t in visible_texts) html = urllib.request.urlopen(‘http://www.nytimes.com/2009/12/21/us/21storm.html’).read() … Read more

Beautiful Soup: ‘ResultSet’ object has no attribute ‘find_all’?

The table variable contains a list. You would need to call find_all on its members (even though you know it’s a list with only one member), not on the entire thing. >>> type(table) <class ‘bs4.element.ResultSet’> >>> type(table[0]) <class ‘bs4.element.Tag’> >>> len(table[0].find_all(‘tr’)) 6 >>>

How to find elements by class

You can refine your search to only find those divs with a given class using BS3: mydivs = soup.find_all(“div”, {“class”: “stylelistrow”})

UnicodeEncodeError: ‘charmap’ codec can’t encode characters

I was getting the same UnicodeEncodeError when saving scraped web content to a file. To fix it I replaced this code: with open(fname, “w”) as f: f.write(html) with this: with open(fname, “w”, encoding=”utf-8″) as f: f.write(html) If you need to support Python 2, then use this: import io with io.open(fname, “w”, encoding=”utf-8″) as f: f.write(html) … Read more

Extracting an attribute value with beautifulsoup

.find_all() returns list of all found elements, so: input_tag = soup.find_all(attrs={“name” : “stainfo”}) input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do: output = input_tag[0][‘value’] or use .find() method which returns only one (first) found element: input_tag = soup.find(attrs={“name”: “stainfo”}) output = input_tag[‘value’]

retrieve links from web page using python and BeautifulSoup [closed]

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more