scrapy - w3toppers.com

Crawling with an authenticated session in Scrapy

Do not override the parse function in a CrawlSpider: When you are using a CrawlSpider, you shouldn’t override the parse function. There’s a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules. Logging in before … Read more

How to integrate Flask & Scrapy?

Adding HTTP server in front of your spiders is not that easy. There are couple of options. 1. Python subprocess If you are really limited to Flask, if you can’t use anything else, only way to integrate Scrapy with Flask is by launching external process for every spider crawl as other answer recommends (note that … Read more

Scraping ajax pages using python

First of all, scrapy docs are available at https://scrapy.readthedocs.org/en/latest/. Speaking about handling ajax while web scraping. Basically, the idea is rather simple: open browser developer tools, network tab go to the target site click submit button and see what XHR request is going to the server simulate this XHR request in your spider Also see: … Read more

Scrapy – how to manage cookies/sessions

Three years later, I think this is exactly what you were looking for: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar Just use something like this in your spider’s start_requests method: for i, url in enumerate(urls): yield scrapy.Request(“http://www.example.com”, meta={‘cookiejar’: i}, callback=self.parse_page) And remember that for subsequent requests, you need to explicitly reattach the cookiejar each time: def parse_page(self, response): # do some … Read more

Scrapy and proxies

From the Scrapy FAQ, Does Scrapy work with HTTP proxies? Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware. The easiest way to use a proxy is to set the environment variable http_proxy. How this is done depends on your shell. C:\>set http_proxy=http://proxy:port csh% setenv http_proxy … Read more

“OSError: [Errno 1] Operation not permitted” when installing Scrapy in OSX 10.11 (El Capitan) (System Integrity Protection)

pip install –ignore-installed six Would do the trick. Source: github.com/pypa/pip/issues/3165

Scrapy Very Basic Example

TL;DR: see Self-contained minimum example script to run scrapy. First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable. If you are … Read more

How to pass a user defined argument in scrapy spider

Spider arguments are passed in the crawl command using the -a option. For example: scrapy crawl myspider -a category=electronics -a domain=system Spiders can access arguments as attributes: class MySpider(scrapy.Spider): name=”myspider” def __init__(self, category=”, **kwargs): self.start_urls = [f’http://www.example.com/{category}’] # py36 super().__init__(**kwargs) # python3 def parse(self, response) self.log(self.domain) # system Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments Update … Read more

Scrapy – Reactor not Restartable [duplicate]

You cannot restart the reactor, but you should be able to run it more times by forking a separate process: import scrapy import scrapy.crawler as crawler from scrapy.utils.log import configure_logging from multiprocessing import Process, Queue from twisted.internet import reactor # your spider class QuotesSpider(scrapy.Spider): name = “quotes” start_urls = [‘http://quotes.toscrape.com/tag/humor/’] def parse(self, response): for quote … Read more

How to run Scrapy from within a Python script

All other answers reference Scrapy v0.x. According to the updated docs, Scrapy 1.0 demands: import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition … process = CrawlerProcess({ ‘USER_AGENT’: ‘Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)’ }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished