How to write a crawler?

You’ll be reinventing the wheel, to be sure. But here’s the basics: A list of unvisited URLs – seed this with one or more starting pages A list of visited URLs – so you don’t go around in circles A set of rules for URLs you’re not interested in – so you don’t index the … Read more

Detecting ‘stealth’ web-crawlers

A while back, I worked with a smallish hosting company to help them implement a solution to this. The system I developed examined web server logs for excessive activity from any given IP address and issued firewall rules to block offenders. It included whitelists of IP addresses/ranges based on http://www.iplists.com/, which were then updated automatically … Read more

Do Google’s crawlers interpret Javascript? What if I load a page through AJAX? [closed]

Despite the answers above, apparently it does interpret JavaScript, to an extent, according to Matt Cutts: “For a while, we were scanning within JavaScript, and we were looking for links. Google has gotten smarter about JavaScript and can execute some JavaScript. I wouldn’t say that we execute all JavaScript, so there are some conditions in … Read more