scrapy - w3toppers.com

find a word on a website and get its page link

Main problem is wrong allowed_domain – it has to be without path / allowed_domains = [“www.reichelt.com”] Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0). It also uses some useless lines of code. It gets contenttype but never use it to … Read more

Scrapy CrawlSpider doesn’t crawl the first landing page

Just change your callback to parse_start_url and override it: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class DownloadSpider(CrawlSpider): name=”downloader” allowed_domains = [‘bnt-chemicals.de’] start_urls = [ “http://www.bnt-chemicals.de”, ] rules = ( Rule(SgmlLinkExtractor(allow=’prod’), callback=’parse_start_url’, follow=True), ) fname = 0 def parse_start_url(self, response): self.fname += 1 fname=”%s.txt” % self.fname with open(fname, ‘w’) as f: f.write(‘%s, %s\n’ … Read more

Creating a generic scrapy spider

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so: a = open(“test.py”) from compiler import compile d = compile(a.read(), ‘spider.py’, ‘exec’) eval(d) MySpider <class ‘__main__.MySpider’> print MySpider.start_urls [‘http://www.somedomain.com’]

Locally run all of the spiders in Scrapy

Why didn’t you just use something like: scrapy list|xargs -n 1 scrapy crawl ?

Scrapy: ImportError: No module named items

From this message on google groups: Your spider module is named the same as your scrapy project module, so python is trying to import items relative to byub.py spider. You are facing a common regret of python imports, see http://www.python.org/dev/peps/pep-0328 quicks fixes: rename your spider module to byub_org.py or similar. or use from __future__ import … Read more

scrapy- how to stop Redirect (302)

yes you can do this simply by adding meta values like meta={‘dont_redirect’: True} also you can stop redirected for a particular response code like meta={‘dont_redirect’: True,”handle_httpstatus_list”: [302]} it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them. example yield Request(‘some url’, meta = … Read more

How to use PyCharm to debug Scrapy projects

The scrapy command is a python script which means you can start it from inside PyCharm. When you examine the scrapy binary (which scrapy) you will notice that this is actually a python script: #!/usr/bin/python from scrapy.cmdline import execute execute() This means that a command like scrapy crawl IcecatCrawler can also be executed like this: … Read more

Get document DOCTYPE with BeautifulSoup

Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you’re no doubt expecting one or none!) def doctype(soup): items = [item for item in soup.contents if isinstance(item, bs4.Doctype)] return items[0] if items else None

Python Scrapy: Convert relative paths to absolute paths

From Scrapy docs: def parse(self, response): # … code ommited next_page = response.urljoin(next_page) yield scrapy.Request(next_page, self.parse) that is, response object has a method to do exactly this.

Force my scrapy spider to stop crawling

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. In the 0.14 release note doc is mentioned: “Added CloseSpider exception to manually close spiders (r2691)” Example as per the docs: def parse_page(self, response): if ‘Bandwidth exceeded’ in response.body: raise CloseSpider(‘bandwidth_exceeded’) See also: http://readthedocs.org/docs/scrapy/en/latest/topics/exceptions.html?highlight=closeSpider