web-crawler - w3toppers.com

How to pass a user defined argument in scrapy spider

Spider arguments are passed in the crawl command using the -a option. For example: scrapy crawl myspider -a category=electronics -a domain=system Spiders can access arguments as attributes: class MySpider(scrapy.Spider): name=”myspider” def __init__(self, category=”, **kwargs): self.start_urls = [f’http://www.example.com/{category}’] # py36 super().__init__(**kwargs) # python3 def parse(self, response) self.log(self.domain) # system Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments Update … Read more

Scrapy – Reactor not Restartable [duplicate]

You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more

How to run Scrapy from within a Python script

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished

TypeError: can’t use a string pattern on a bytes-like object in re.findall()

You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode(‘utf-8’). See Convert bytes to a Python String

How do I make a simple crawler in PHP? [closed]

Meh. Don’t parse HTML with regexes. Here’s a DOM version inspired by Tatu’s: <?php function crawl_page($url, $depth = 5) { static $seen = array(); if (isset($seen[$url]) || $depth === 0) { return; } $seen[$url] = true; $dom = new DOMDocument(‘1.0’); @$dom->loadHTMLFile($url); $anchors = $dom->getElementsByTagName(‘a’); foreach ($anchors as $element) { $href = $element->getAttribute(‘href’); if (0 !== … Read more

How can I scrape pages with dynamic content using node.js?

Here you go; var phantom = require(‘phantom’); phantom.create(function (ph) { ph.createPage(function (page) { var url = “http://www.bdtong.co.kr/index.php?c_category=C02”; page.open(url, function() { page.includeJs(“http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js”, function() { page.evaluate(function() { $(‘.listMain > li’).each(function () { console.log($(this).find(‘a’).attr(‘href’)); }); }, function(){ ph.exit() }); }); }); }); });

how to detect search engine bots with php?

I use the following code which seems to be working fine: function _bot_detected() { return ( isset($_SERVER[‘HTTP_USER_AGENT’]) && preg_match(‘/bot|crawl|slurp|spider|mediapartners/i’, $_SERVER[‘HTTP_USER_AGENT’]) ); } update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en added mediapartners

Sending “User-agent” using Requests library in Python

The user-agent should be specified as a field in the header. Here is a list of HTTP header fields, and you’d probably be interested in request-specific fields, which includes User-Agent. If you’re using requests v2.13 and newer The simplest way to do what you want is to create a dictionary and specify your headers directly, … Read more

Writing a web crawler in Java [closed]

You can use JSoup library for this. The JSoup CookBook is a good start.

Python Google Web Crawler

for stackoverflow you can use the api directly for example: https://api.stackexchange.com/2.1/questions?fromdate=1381190400&todate=1381276800&order=desc&sort=activity&tagged=python&site=stackoverflow see https://api.stackexchange.com/docs/questions#fromdate=2013-10-08&todate=2013-10-09&order=desc&sort=activity&tagged=python&filter=default&site=stackoverflow you can’t making more 30 requests a second see http://api.stackexchange.com/docs/throttle