Scraping webpage generated by JavaScript with C#

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

At a high-level, these are the steps:

  1. Installed selenium: http://docs.seleniumhq.org/
  2. Started the selenium hub as a service
  3. Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
  4. Started phantomjs in webdriver mode pointing to the selenium hub
  5. In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

It appears there is a nuget package for the phantomjs, such that you don’t need the hub (I used a cluster to do massive scrapping in this manner):

Install web driver:

Install-Package Selenium.WebDriver

Install embedded exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

Leave a Comment