find a word on a website and get its page link

Main problem is wrong allowed_domain – it has to be without path /

    allowed_domains = ["www.reichelt.com"]

Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0).

It also uses some useless lines of code.

It gets contenttype but never use it to decode request.body. Your url uses iso8859-1 for original language and utf-8 for ?LANGUAGE=PL – but you can simply use request.text and it will automatically decode it.

It also uses ok = False and later check it but it is totally useless.


Minimal working code – you can copy it to single file and run as python script.py without creating project.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

def find_all_substrings(string, sub):
    return [match.start() for match in re.finditer(re.escape(sub), string)]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    #start_urls = ["https://www.reichelt.com/?LANGUAGE=PL"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    #crawl_count = 0
    #words_found = 0                                 

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        #self.crawl_count += 1

        #content_type = response.headers.get("content-type", "").decode('utf-8').lower()
        #print('content_type:', content_type)
        #data = response.body.decode('utf-8')
        
        data = response.text

        for word in wordlist:
            print('[check_buzzwords] check word:', word)
            substrings = find_all_substrings(data, word)
            print('[check_buzzwords] substrings:', substrings)
            
            for pos in substrings:
                #self.words_found += 1
                # only display
                print('[check_buzzwords] word: {} | pos: {} | sub: {} | url: {}'.format(word, pos, data[pos-20:pos+20], response.url))
                # send to file
                yield {'word': word, 'pos': pos, 'sub': data[pos-20:pos+20], 'url': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 

EDIT:

I added data[pos-20:pos+20] to yielded data to see where is substring and sometimes it is in URL like .../elements/adw_2018/catalog/... or other place like <img alt=""catalog"" – so using regex doesn’t have to be good idea. Maybe better is to use xpath or css selector to search text only in some places or in links.


EDIT:

Version which search links with words from list. It uses response.xpath to search all linsk and later it check if there is word in href – so it doesn’t need regex.

Problem can be that it treats link with -downloads- (with s) as link with word download and downloads so it would need more complex method to check (ie. using regex) to treats it only as link with word downloads

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

wordlist = [
    "katalog",
    "catalog",
    "downloads",
    "download",
]

class WebsiteSpider(CrawlSpider):

    name = "webcrawler"
    
    allowed_domains = ["www.reichelt.com"]
    start_urls = ["https://www.reichelt.com/"]
    
    rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]

    def check_buzzwords(self, response):
        print('[check_buzzwords] url:', response.url)
        
        links = response.xpath('//a[@href]')
        
        for word in wordlist:
            
            for link in links:
                url = link.attrib.get('href')
                if word in url:
                    print('[check_buzzwords] word: {} | url: {} | page: {}'.format(word, url, response.url))
                    # send to file
                    yield {'word': word, 'url': url, 'page': response.url}

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1
})
c.crawl(WebsiteSpider)
c.start() 

Leave a Comment