Selenium wait for Ajax content to load – universal approach

You need to wait for Javascript and jQuery to finish loading. Execute Javascript to check if jQuery.active is 0 and document.readyState is complete, which means the JS and jQuery load is complete. public boolean waitForJSandJQueryToLoad() { WebDriverWait wait = new WebDriverWait(driver, 30); // wait for jQuery to load ExpectedCondition<Boolean> jQueryLoad = new ExpectedCondition<Boolean>() { @Override … Read more

Click a Button in Scrapy

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more

Do Google’s crawlers interpret Javascript? What if I load a page through AJAX? [closed]

Despite the answers above, apparently it does interpret JavaScript, to an extent, according to Matt Cutts: “For a while, we were scanning within JavaScript, and we were looking for links. Google has gotten smarter about JavaScript and can execute some JavaScript. I wouldn’t say that we execute all JavaScript, so there are some conditions in … Read more

Pulling data from a webpage, parsing it for specific pieces, and displaying it

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements. protected void Page_Load(object sender, EventArgs e) { string url = “http://www.metacritic.com/game/pc/halo-spartan-assault”; var web = new HtmlAgilityPack.HtmlWeb(); HtmlDocument doc = web.Load(url); string metascore = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]”)[0].InnerText; string userscore = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]”)[0].InnerText; string summary = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]”)[0].InnerText; } An easy way to obtain the XPath … Read more