Scrapy CrawlSpider doesn’t crawl the first landing page

Just change your callback to parse_start_url and override it: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class DownloadSpider(CrawlSpider): name=”downloader” allowed_domains = [‘bnt-chemicals.de’] start_urls = [ “http://www.bnt-chemicals.de”, ] rules = ( Rule(SgmlLinkExtractor(allow=’prod’), callback=’parse_start_url’, follow=True), ) fname = 0 def parse_start_url(self, response): self.fname += 1 fname=”%s.txt” % self.fname with open(fname, ‘w’) as f: f.write(‘%s, %s\n’ … Read more

Creating a generic scrapy spider

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so: a = open(“test.py”) from compiler import compile d = compile(a.read(), ‘spider.py’, ‘exec’) eval(d) MySpider <class ‘__main__.MySpider’> print MySpider.start_urls [‘http://www.somedomain.com’]

Scrapy: ImportError: No module named items

From this message on google groups: Your spider module is named the same as your scrapy project module, so python is trying to import items relative to byub.py spider. You are facing a common regret of python imports, see http://www.python.org/dev/peps/pep-0328 quicks fixes: rename your spider module to byub_org.py or similar. or use from __future__ import … Read more

scrapy- how to stop Redirect (302)

yes you can do this simply by adding meta values like meta={‘dont_redirect’: True} also you can stop redirected for a particular response code like meta={‘dont_redirect’: True,”handle_httpstatus_list”: [302]} it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them. example yield Request(‘some url’, meta = … Read more

Force my scrapy spider to stop crawling

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. In the 0.14 release note doc is mentioned: “Added CloseSpider exception to manually close spiders (r2691)” Example as per the docs: def parse_page(self, response): if ‘Bandwidth exceeded’ in response.body: raise CloseSpider(‘bandwidth_exceeded’) See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider