TypeError: Object of type ‘bytes’ is not JSON serializable

You are creating those bytes objects yourself: item[‘title’] = [t.encode(‘utf-8’) for t in title] item[‘link’] = [l.encode(‘utf-8’) for l in link] item[‘desc’] = [d.encode(‘utf-8’) for d in desc] items.append(item) Each of those t.encode(), l.encode() and d.encode() calls creates a bytes string. Do not do this, leave it to the JSON format to serialise these. Next, … Read more

crawl site that has infinite scrolling using python

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

Scrapy crawl from script always blocks script execution after scraping

You will need to stop the reactor when the spider finishes. You can accomplish this by listening for the spider_closed signal: from twisted.internet import reactor from scrapy import log, signals from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy.xlib.pydispatch import dispatcher from testspiders.spiders.followall import FollowAllSpider def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = FollowAllSpider(domain=’scrapinghub.com’) crawler … Read more

scrapy: Call a function when a spider quits

It looks like you can register a signal listener through dispatcher. I would try something like: from scrapy import signals from scrapy.xlib.pydispatch import dispatcher class MySpider(CrawlSpider): def __init__(self): dispatcher.connect(self.spider_closed, signals.spider_closed) def spider_closed(self, spider): # second param is instance of spder about to be closed. In the newer version of scrapy scrapy.xlib.pydispatch is deprecated. instead you … Read more

Running Scrapy spiders in a Celery task

Okay here is how I got Scrapy working with my Django project that uses Celery for queuing up what to crawl. The actual workaround came primarily from joehillen’s code located here http://snippets.scrapy.org/snippets/13/ First the tasks.py file from celery import task @task() def crawl_domain(domain_pk): from crawl import domain_crawl return domain_crawl(domain_pk) Then the crawl.py file from multiprocessing … Read more

How to give URL to scrapy for crawling?

I’m not really sure about the commandline option. However, you could write your spider like this. class MySpider(BaseSpider): name=”my_spider” def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get(‘start_url’)] And start it like: scrapy crawl my_spider -a start_url=”http://some_url”

How can i use multiple requests and pass items in between them in scrapy python

No problem. Following is correct version of your code: def page_parser(self, response): sites = hxs.select(‘//div[@class=”row”]’) items = [] request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription1) request.meta[‘item’] = item yield request request = Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription2, meta={‘item’: item}) yield request yield Request(“http://www.example.com/lin1.cpp”, callback=self.parseDescription3, meta={‘item’: item}) def parseDescription1(self,response): item = response.meta[‘item’] item[‘desc1’] = “test” return item def parseDescription2(self,response): item = response.meta[‘item’] … Read more