web-crawler - w3toppers.com

Selenium wait for Ajax content to load – universal approach

You need to wait for Javascript and jQuery to finish loading. Execute Javascript to check if jQuery.active is 0 and document.readyState is complete, which means the JS and jQuery load is complete. public boolean waitForJSandJQueryToLoad() { WebDriverWait wait = new WebDriverWait(driver, 30); // wait for jQuery to load ExpectedCondition<Boolean> jQueryLoad = new ExpectedCondition<Boolean>() { @Override … Read more

Spider a Website and Return URLs Only

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately. wget –spider –force-html -r -l2 $url 2>&1 \ | grep … Read more

Click a Button in Scrapy

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more

Fetch contents(loaded through AJAX call) of a web page

Jsoup is a html parser only. Unfortunately it’s not possible to parse any javascript / ajax content, since jsoup can’t execute those. The solution: using a library which can handle Scripts. Here are some examples i know: HtmlUnit Java Script Engine Apache Commons BSF Rhino If such a library doesn’t support parsing or selectors, you … Read more

How can I handle Javascript in a Perl web crawler?

Another option might be Selenium with WWW::Selenium module

Do Google’s crawlers interpret Javascript? What if I load a page through AJAX? [closed]

Despite the answers above, apparently it does interpret JavaScript, to an extent, according to Matt Cutts: “For a while, we were scanning within JavaScript, and we were looking for links. Google has gotten smarter about JavaScript and can execute some JavaScript. I wouldn’t say that we execute all JavaScript, so there are some conditions in … Read more

Parse HTML content in VBA

Just a couple things that hopefully will get you in the right direction: clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context – code will pause after the send instruction, to resume only once the server response is received, or has … Read more

Pulling data from a webpage, parsing it for specific pieces, and displaying it

This small example uses HtmlAgilityPack, and using XPath selectors to get to the desired elements. protected void Page_Load(object sender, EventArgs e) { string url = “http://www.metacritic.com/game/pc/halo-spartan-assault”; var web = new HtmlAgilityPack.HtmlWeb(); HtmlDocument doc = web.Load(url); string metascore = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]”)[0].InnerText; string userscore = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]”)[0].InnerText; string summary = doc.DocumentNode.SelectNodes(“//*[@id=\”main\”]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]”)[0].InnerText; } An easy way to obtain the XPath … Read more

Anyone know of a good Python based web crawler that I could use?

Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission). Twill is a simple scripting language built on top of Mechanize BeautifulSoup + urllib2 also works quite nicely. Scrapy looks like an extremely promising project; it’s new.