scrapy - w3toppers.com

how to handle 302 redirect in scrapy

Forgot about middlewares in this scenario, this will do the trick: meta = {‘dont_redirect’: True,’handle_httpstatus_list’: [302]} That said, you will need to include meta parameter when you yield your request: yield Request(item[‘link’],meta = { ‘dont_redirect’: True, ‘handle_httpstatus_list’: [302] }, callback=self.your_callback)

How can I get all the plain text from a website with Scrapy?

The easiest option would be to extract //body//text() and join everything found: ”.join(sel.select(“//body//text()”).extract()).strip() where sel is a Selector instance. Another option is to use nltk‘s clean_html(): >>> import nltk >>> html = “”” … <div class=”post-text” itemprop=”description”> … … <p>I would like to have all the text visible from a website, after the HTML is … Read more

how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings import os from scrapy.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): “””A dupe filter that considers specific ids in the url””” def __getid(self, url): mm = url.split(“&refer”)[0] #or something like that return mm def request_seen(self, request): fp = self.__getid(request.url) if fp in self.fingerprints: return True … Read more

Scrapy: Follow link to get additional Item data?

Please, first read the docs to understand what i say. The answer: To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter. … Read more

How to get the scrapy failure URLs?

Yes, this is possible. The code below adds a failed_urls list to a basic spider class and appends urls to it if the response status of the url is 404 (this would need to be extended to cover other error statuses as required). Next I added a handle that joins the list into a single … Read more

Access django models inside of Scrapy

If anyone else is having the same problem, this is how I solved it. I added this to my scrapy settings.py file: def setup_django_env(path): import imp, os from django.core.management import setup_environ f, filename, desc = imp.find_module(‘settings’, [path]) project = imp.load_module(‘settings’, f, filename, desc) setup_environ(project) setup_django_env(‘/path/to/django/project/’) Note: the path above is to your django project folder, … Read more

Scrapy image download how to use custom filename

This is just actualization of the answer for scrapy 0.24 (EDITED), where the image_key() is deprecated class MyImagesPipeline(ImagesPipeline): #Name download version def file_path(self, request, response=None, info=None): #item=request.meta[‘item’] # Like this you can use all from item, not just url. image_guid = request.url.split(“https://stackoverflow.com/”)[-1] return ‘full/%s’ % (image_guid) #Name thumbnail version def thumb_path(self, request, thumb_id, response=None, info=None): … Read more

Click a Button in Scrapy

Scrapy cannot interpret javascript. If you absolutely must interact with the javascript on the page, you want to be using Selenium. If using Scrapy, the solution to the problem depends on what the button is doing. If it’s just showing content that was previously hidden, you can scrape the data without a problem, it doesn’t … Read more

passing selenium response url to scrapy

Use Downloader Middleware to catch selenium-required pages before you process them regularly with Scrapy: The downloader middleware is a framework of hooks into Scrapy’s request/response processing. It’s a light, low-level system for globally altering Scrapy’s requests and responses. Here’s a very basic example using PhantomJS: from scrapy.http import HtmlResponse from selenium import webdriver class JSMiddleware(object): … Read more

Run a Scrapy spider in a Celery Task

The twisted reactor cannot be restarted. A work around for this is to let the celery task fork a new child process for each crawl you want to execute as proposed in the following post: Running Scrapy spiders in a Celery task This gets around the “reactor cannot be restart-able” issue by utilizing the multiprocessing … Read more