Scrapy Crawl URLs in Order

Scrapy Request has a priority attribute now. If you have many Request in a function and want to process a particular request first, you can set: def parse(self, response): url=”http://www.example.com/first” yield Request(url=url, callback=self.parse_data, priority=1) url=”http://www.example.com/second” yield Request(url=url, callback=self.parse_data) Scrapy will process the one with priority=1 first.

Headless Browser and scraping – solutions [closed]

If Ruby is your thing, you may also try: https://github.com/chriskite/anemone (dev stopped) https://github.com/sparklemotion/mechanize https://github.com/postmodern/spidr https://github.com/stewartmckee/cobweb http://watirwebdriver.com/ (Selenium) also, Nokogiri gem can be used for scraping: http://nokogiri.org/ there is a dedicated book about how to utilise nokogiri for scraping by packt publishing

Scraping dynamic content using python-Scrapy

You can also solve it with ScrapyJS (no need for selenium and a real browser): This library provides Scrapy+JavaScript integration using Splash. Follow the installation instructions for Splash and ScrapyJS, start the splash docker container: $ docker run -p 8050:8050 scrapinghub/splash Put the following settings into settings.py: SPLASH_URL = ‘http://192.168.59.103:8050’ DOWNLOADER_MIDDLEWARES = { ‘scrapyjs.SplashMiddleware’: 725, … Read more

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more