Crawling with an authenticated session in Scrapy

Do not override the parse function in a CrawlSpider:

When you are using a CrawlSpider, you shouldn’t override the parse function. There’s a warning in the CrawlSpider documentation here: http://doc.scrapy.org/en/0.14/topics/spiders.html#scrapy.contrib.spiders.Rule

This is because with a CrawlSpider, parse (the default callback of any request) sends the response to be processed by the Rules.


Logging in before crawling:

In order to have some kind of initialisation before a spider starts crawling, you can use an InitSpider (which inherits from a CrawlSpider), and override the init_request function. This function will be called when the spider is initialising, and before it starts crawling.

In order for the Spider to begin crawling, you need to call self.initialized.

You can read the code that’s responsible for this here (it has helpful docstrings).


An example:

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

class MySpider(InitSpider):
    name="myspider"
    allowed_domains = ['example.com']
    login_page="http://www.example.com/login"
    start_urls = ['http://www.example.com/useful_page/',
                  'http://www.example.com/another_useful_page/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
             callback='parse_item', follow=True),
    )

    def init_request(self):
        """This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        """Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'name': 'herman', 'password': 'password'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "Hi Herman" in response.body:
            self.log("Successfully logged in. Let's start crawling!")
            # Now the crawling can begin..
            return self.initialized()
        else:
            self.log("Bad times :(")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse_item(self, response):

        # Scrape data from page

Saving items:

Items your Spider returns are passed along to the Pipeline which is responsible for doing whatever you want done with the data. I recommend you read the documentation: http://doc.scrapy.org/en/0.14/topics/item-pipeline.html

If you have any problems/questions in regards to Items, don’t hesitate to pop open a new question and I’ll do my best to help.

Leave a Comment