scrapy - w3toppers.com

getting Forbidden by robots.txt: scrapy

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY = False Here are the release notes

TypeError: Object of type ‘bytes’ is not JSON serializable

You are creating those bytes objects yourself: item[‘title’] = [t.encode(‘utf-8’) for t in title] item[‘link’] = [l.encode(‘utf-8’) for l in link] item[‘desc’] = [d.encode(‘utf-8’) for d in desc] items.append(item) Each of those t.encode(), l.encode() and d.encode() calls creates a bytes string. Do not do this, leave it to the JSON format to serialise these. Next, … Read more

How can I use different pipelines for different spiders in a single Scrapy project

Just remove all pipelines from main settings and use this inside spider. This will define the pipeline to user per spider class testSpider(InitSpider): name=”test” custom_settings = { ‘ITEM_PIPELINES’: { ‘app.MyPipeline’: 400 } }

crawl site that has infinite scrolling using python

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

Scrapy crawl from script always blocks script execution after scraping

You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal: from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy.xlib.pydispatch import dispatcher from testspiders.spiders.followall import FollowAllSpider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = FollowAllSpider(domain=’scrapinghub.com’) crawler … Read more

scrapy: Call a function when a spider quits

It looks like you can register a signal listener through dispatcher. I would try something like: from scrapy import signals from scrapy.xlib.pydispatch import dispatcher class MySpider(CrawlSpider): def __init__(self): dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): # second param is instance of spder about to be closed. In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you … Read more

Running Scrapy spiders in a Celery task

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen’s code located here http://snippets.scrapy.org/snippets/13/ First the tasks.py file from celery import task @task() def crawl_domain(domain_pk): from crawl import domain_crawl return domain_crawl(domain_pk) Then the crawl.py file from multiprocessing … Read more

How to give URL to scrapy for crawling?

I’m not really sure about the commandline option. However, you could write your spider like this. class MySpider(BaseSpider): name=”my_spider” def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get(‘start_url’)] And start it like: scrapy crawl my_spider -a start_url=”http://some_url”

Using Scrapy with authenticated (logged in) user session

In the code above, the FormRequest that is being used to authenticate has the after_login function set as its callback. This means that the after_login function will be called and passed the page that the login attempt got as a response. It is then checking that you are successfully logged in by searching the page … Read more

How can i use multiple requests and pass items in between them in scrapy python

No problem. Following is correct version of your code: def page_parser(self, response): sites = hxs.select(‘//div[@class=”row”]’) items = [] request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription1) request.meta[‘item’] = item yield request request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription2, meta={‘item’: item}) yield request yield Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription3, meta={‘item’: item}) def parseDescription1(self,response): item = response.meta[‘item’] item[‘desc1’] = “test” return item def parseDescription2(self,response): item = response.meta[‘item’] … Read more