how to filter duplicate requests based on url in scrapy

You can write custom middleware for duplicate removal and add it in settings

import os

from scrapy.dupefilter import RFPDupeFilter

class CustomFilter(RFPDupeFilter):
"""A dupe filter that considers specific ids in the url"""

    def __getid(self, url):
        mm = url.split("&refer")[0] #or something like that
        return mm

    def request_seen(self, request):
        fp = self.__getid(request.url)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

Then you need to set the correct DUPFILTER_CLASS in settings.py

DUPEFILTER_CLASS = 'scraper.duplicate_filter.CustomFilter'

It should work after that

Leave a Comment