rvest - w3toppers.com

How to scrape tables inside a comment tag in html with R?

You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML: library(rvest) # scrape page h <- read_html(‘http://www.basketball-reference.com/teams/CHI/2015.html’) df <- h %>% html_nodes(xpath=”//comment()”) %>% # select comment nodes html_text() %>% # extract comment text paste(collapse=””) %>% # collapse to a single string read_html() %>% # reparse to HTML html_node(‘table#advanced’) … Read more

rvest Error in open.connection(x, “rb”) : Timeout was reached

I encountered the same Error in open.connection(x, “rb”) : Timeout was reached issue when working behind a proxy in the office network. Here’s what worked for me, library(rvest) url = “http://google.com” download.file(url, destfile = “scrapedpage.html”, quiet=TRUE) content <- read_html(“scrapedpage.html”) Credit : https://stackoverflow.com/a/38463559

R web scraping across multiple pages

You can do something similar with purrr::map_df() as well if you want all the info as a data.frame: library(rvest) library(purrr) url_base <- “http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d” map_df(1:39, function(i) { # simple but effective progress indicator cat(“.”) pg <- read_html(sprintf(url_base, i)) data.frame(wine=html_text(html_nodes(pg, “.review-listing .title”)), excerpt=html_text(html_nodes(pg, “div.excerpt”)), rating=gsub(” Points”, “”, html_text(html_nodes(pg, “span.rating”))), appellation=html_text(html_nodes(pg, “span.appellation”)), price=gsub(“\\$”, “”, html_text(html_nodes(pg, “span.price”))), stringsAsFactors=FALSE) … Read more

Scraping javascript website in R

So, RSelenium is not the only answer (anymore). If you can install the PhantomJS binary (grab phantomjs binaries from here: http://phantomjs.org/) then you can use it to render the HTML and scrape it with rvest (similar to the RSelenium approach but doesn’t require java): library(rvest) # render HTML from the site with phantomjs url <- … Read more