scrapy - w3toppers.com

ReactorNotRestartable error in while loop with scrapy

By default, CrawlerProcess‘s .start() will stop the Twisted reactor it creates when all crawlers have finished. You should call process.start(stop_after_crawl=False) if you create process in each iteration. Another option is to handle the Twisted reactor yourself and use CrawlerRunner. The docs have an example on doing that.

Cannot install Lxml on Mac OS X 10.9

You should install or upgrade the commandline tool for Xcode. Try this in a terminal: xcode-select –install

Scrapy Crawl URLs in Order

Scrapy Request has a priority attribute now. If you have many Request in a function and want to process a particular request first, you can set: def parse(self, response): url=”http://www.example.com/first” yield Request(url=url, callback=self.parse_data, priority=1) url=”http://www.example.com/second” yield Request(url=url, callback=self.parse_data) Scrapy will process the one with priority=1 first.

Headless Browser and scraping – solutions [closed]

If Ruby is your thing, you may also try: https://github.com/chriskite/anemone (dev stopped) https://github.com/sparklemotion/mechanize https://github.com/postmodern/spidr https://github.com/stewartmckee/cobweb http://watirwebdriver.com/ (Selenium) also, Nokogiri gem can be used for scraping: http://nokogiri.org/ there is a dedicated book about how to utilise nokogiri for scraping by packt publishing

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

Once upon a time I stumbled with this issue. If you’re using macOS go to Macintosh HD > Applications > Python3.6 folder (or whatever version of python you’re using) > double click on “Install Certificates.command” file. 😀

Scraping dynamic content using python-Scrapy

You can also solve it with ScrapyJS (no need for selenium and a real browser): This library provides Scrapy+JavaScript integration using Splash. Follow the installation instructions for Splash and ScrapyJS, start the splash docker container: $ docker run -p 8050:8050 scrapinghub/splash Put the following settings into settings.py: SPLASH_URL = ‘http://192.168.59.103:8050’ DOWNLOADER_MIDDLEWARES = { ‘scrapyjs.SplashMiddleware’: 725, … Read more

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get. Here’s an example how you can follow pagination on ebay using Scrapy+Selenium: import scrapy from selenium import webdriver class ProductSpider(scrapy.Spider): name = “product_spider” allowed_domains = [‘ebay.com’] start_urls = [‘http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40’] def __init__(self): self.driver = … Read more

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

Here is a simple example of scrapy with an AJAX request. Let see the site rubin-kazan.ru. All messages are loaded with an AJAX request. My goal is to fetch these messages with all their attributes (author, date, …): When I analyze the source code of the page I can’t see all these messages because the … Read more