web-crawler - w3toppers.com

Python: maximum recursion depth exceeded while calling a Python object

Python don’t have a great support for recursion because of it’s lack of TRE (Tail Recursion Elimination). This means that each call to your recursive function will create a function call stack and because there is a limit of stack depth (by default is 1000) that you can check out by sys.getrecursionlimit (of course you … Read more

getting Forbidden by robots.txt: scrapy

In the new version (scrapy 1.1) launched 2016-05-11 the crawl first downloads robots.txt before crawling. To change this behavior change in your settings.py with ROBOTSTXT_OBEY ROBOTSTXT_OBEY = False Here are the release notes

How can I use different pipelines for different spiders in a single Scrapy project

Just remove all pipelines from main settings and use this inside spider. This will define the pipeline to user per spider class testSpider(InitSpider): name=”test” custom_settings = { ‘ITEM_PIPELINES’: { ‘app.MyPipeline’: 400 } }

crawl site that has infinite scrolling using python

You can use selenium to scrap the infinite scrolling website like twitter or facebook. Step 1 : Install Selenium using pip pip install selenium Step 2 : use the code below to automate infinite scroll and extract the source code from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import … Read more

How to give URL to scrapy for crawling?

I’m not really sure about the commandline option. However, you could write your spider like this. class MySpider(BaseSpider): name=”my_spider” def __init__(self, *args, **kwargs): super(MySpider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get(‘start_url’)] And start it like: scrapy crawl my_spider -a start_url=”http://some_url”

how to extract links and titles from a .html page?

Thank you everyone, I GOT IT! The final code: $html = file_get_contents(‘bookmarks.html’); //Create a new DOM document $dom = new DOMDocument; //Parse the HTML. The @ is used to suppress any parsing errors //that will be thrown if the $html string isn’t valid XHTML. @$dom->loadHTML($html); //Get all links. You could also use any other tag … Read more

python: [Errno 10054] An existing connection was forcibly closed by the remote host

This can be caused by the two sides of the connection disagreeing over whether the connection timed out or not during a keepalive. (Your code tries to reused the connection just as the server is closing it because it has been idle for too long.) You should basically just retry the operation over a new … Read more

Nodejs: Async request with a list of URL

You can use something like Promise library e.g. snippet const Promise = require(“bluebird”); const axios = require(“axios”); //Axios wrapper for error handling const axios_wrapper = (options) => { return axios(…options) .then((r) => { return Promise.resolve({ data: r.data, error: null, }); }) .catch((e) => { return Promise.resolve({ data: null, error: e.response ? e.response.data : e, }); … Read more

how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings import os from scrapy.dupefilter import RFPDupeFilter class CustomFilter(RFPDupeFilter): “””A dupe filter that considers specific ids in the url””” def __getid(self, url): mm = url.split(“&refer”)[0] #or something like that return mm def request_seen(self, request): fp = self.__getid(request.url) if fp in self.fingerprints: return True … Read more

How to find all links / pages on a website

Check out linkchecker—it will crawl the site (while obeying robots.txt) and generate a report. From there, you can script up a solution for creating the directory tree.