how to handle 302 redirect in scrapy

Forgot about middlewares in this scenario, this will do the trick: meta = {‘dont_redirect’: True,’handle_httpstatus_list’: [302]} That said, you will need to include meta parameter when you yield your request: yield Request(item[‘link’],meta = { ‘dont_redirect’: True, ‘handle_httpstatus_list’: [302] }, callback=self.your_callback)

How can I get all the plain text from a website with Scrapy?

The easiest option would be to extract //body//text() and join everything found: ”.join(sel.select(“//body//text()”).extract()).strip() where sel is a Selector instance. Another option is to use nltk‘s clean_html(): >>> import nltk >>> html = “”” … <div class=”post-text” itemprop=”description”> … … <p>I would like to have all the text visible from a website, after the HTML is … Read more

how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings import os from scrapy.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): “””A dupe filter that considers specific ids in the url””” def __getid(self, url): mm = url.split(“&refer”)[0] #or something like that return mm def request_seen(self, request): fp = self.__getid(request.url) if fp in self.fingerprints: return True … Read more

Access django models inside of Scrapy

If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more

Scrapy image download how to use custom filename

This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated class MyImagesPipeline(ImagesPipeline): #Name download version def file_path(self, request, response=None, info=None): #item=request.meta[‘item’] # Like this you can use all from item, not just url. image_guid = request.url.split(“https://stackoverflow.com/”)[-1] return ‘full/%s’ % (image_guid) #Name thumbnail version def thumb_path(self, request, thumb_id, response=None, info=None): … Read more

Click a Button in Scrapy

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

passing selenium response url to scrapy

Use Downloader Middleware to catch selenium-required pages before you process them regularly with Scrapy: The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses. Here’s a very basic example using PhantomJS: from scrapy.http import HtmlResponse from selenium import webdriver class JSMiddleware(object): … Read more

Run a Scrapy spider in a Celery Task

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post: Running Scrapy spiders in a Celery task This gets around the “reactor cannot be restart-able” issue by utilizing the multiprocessing … Read more