web-scraping - w3toppers.com

can we use XPath with BeautifulSoup?

Nope, BeautifulSoup, by itself, does not support XPath expressions. An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it’ll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster. … Read more

Is it ok to scrape data from Google results? [closed]

Google disallows automated access in their TOS, so if you accept their terms you would break them. That said, I know of no lawsuit from Google against a scraper. Even Microsoft scraped Google, they powered their search engine Bing with it. They got caught in 2011 red handed 🙂 There are two options to scrape … Read more

How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Provide a User-Agent header: import requests url=”http://www.ichangtou.com/#company:data_000008.html” headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36’} response = requests.get(url, headers=headers) print(response.content) FYI, here is a list of User-Agent strings for different browsers: List of all Browsers As a side note, there is a pretty useful third-party package called … Read more

Scraping html tables into R data frames using the XML package

…or a shorter try: library(XML) library(RCurl) library(rlist) theurl <- getURL(“https://en.wikipedia.org/wiki/Brazil_national_football_team”,.opts = list(ssl.verifypeer = FALSE) ) tables <- readHTMLTable(theurl) tables <- list.clean(tables, fun = is.null, recursive = FALSE) n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) the picked table is the longest one on the page tables[[which.max(n.rows)]]

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after. Its party trick is a CSS selector syntax to find elements, e.g.: String html = “<html><head><title>First parse</title></head>” + “<body><p>Parsed HTML into a doc.</p></body></html>”; Document doc = Jsoup.parse(html); Elements links … Read more

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

How to find elements by class

You can refine your search to only find those divs with a given class using BS3: mydivs = soup.find_all(“div”, {“class”: “stylelistrow”})

How can I pass variable into an evaluate function?

You have to pass the variable as an argument to the pageFunction like this: const links = await page.evaluate((evalVar) => { console.log(evalVar); // 2. should be defined now … }, evalVar); // 1. pass variable as an argument You can pass in multiple variables by passing more arguments to page.evaluate(): await page.evaluate((a, b c) => … Read more

Difference between text and innerHTML using Selenium

To start with, text is a property where as innerHTML is an attribute. Fundamentally there are some differences between a property and an attribute. get_attribute(“innerHTML”) get_attribute(innerHTML) gets the innerHTML of the element. This method will first try to return the value of a property with the given name. If a property with that name doesn’t … Read more

retrieve links from web page using python and BeautifulSoup [closed]

Here’s a short snippet using the SoupStrainer class in BeautifulSoup: import httplib2 from bs4 import BeautifulSoup, SoupStrainer http = httplib2.Http() status, response = http.request(‘http://www.nytimes.com’) for link in BeautifulSoup(response, parse_only=SoupStrainer(‘a’)): if link.has_attr(‘href’): print(link[‘href’]) The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Edit: Note that I used the SoupStrainer class … Read more