web-crawler - w3toppers.com

How to programmatically fill input elements built with React?

This accepted solution appears not to work in React > 15.6 (including React 16) as a result of changes to de-dupe input and change events. You can see the React discussion here: https://github.com/facebook/react/issues/10135 And the suggested workaround here: https://github.com/facebook/react/issues/10135#issuecomment-314441175 Reproduced here for convenience: Instead of input.value=”foo”; input.dispatchEvent(new Event(‘input’, {bubbles: true})); You would use function setNativeValue(element, … Read more

find a word on a website and get its page link

Main problem is wrong allowed_domain – it has to be without path / allowed_domains = [“www.reichelt.com”] Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0). It also uses some useless lines of code. It gets contenttype but never use it to … Read more

Scrapy CrawlSpider doesn’t crawl the first landing page

Just change your callback to parse_start_url and override it: from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor class DownloadSpider(CrawlSpider): name=”downloader” allowed_domains = [‘bnt-chemicals.de’] start_urls = [ “http://www.bnt-chemicals.de”, ] rules = ( Rule(SgmlLinkExtractor(allow=’prod’), callback=’parse_start_url’, follow=True), ) fname = 0 def parse_start_url(self, response): self.fname += 1 fname=”%s.txt” % self.fname with open(fname, ‘w’) as f: f.write(‘%s, %s\n’ … Read more

Crawling the Google Play store

First of all, Google Play’s robots.txt does NOT disallow the pages with base “/store/apps”. If you want to crawl Google Play you would need to develop your own web crawler, parse the HTML page and extract the app meta-data you need (e.g. title, descriptions, price, etc). This topic has been covered in this other question. … Read more

How to print html source to console with phantomjs

Spent some time to read the documentation, it should be obvious afterwards. var page = require(‘webpage’).create(); page.open(‘http://google.com’, function () { console.log(page.content); phantom.exit(); });

Creating a generic scrapy spider

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so: a = open(“test.py”) from compiler import compile d = compile(a.read(), ‘spider.py’, ‘exec’) eval(d) MySpider <class ‘__main__.MySpider’> print MySpider.start_urls [‘http://www.somedomain.com’]

Locally run all of the spiders in Scrapy

Why didn’t you just use something like: scrapy list|xargs -n 1 scrapy crawl ?

scrapy- how to stop Redirect (302)

yes you can do this simply by adding meta values like meta={‘dont_redirect’: True} also you can stop redirected for a particular response code like meta={‘dont_redirect’: True,”handle_httpstatus_list”: [302]} it will stop redirecting only 302 response codes. you can add as many http status code you want to avoid redirecting them. example yield Request(‘some url’, meta = … Read more

I need a Powerful Web Scraper library [closed]

Scraping is easy really, you just have to parse the content you are downloading and get all the associated links. The most important piece though is the part that processes the HTML. Because most browsers don’t require the cleanest (or standards-compliant) HTML in order to be rendered, you need an HTML parser that is going … Read more

Python: Disable images in Selenium Google ChromeDriver

Here is another way to disable images: from selenium import webdriver chrome_options = webdriver.ChromeOptions() prefs = {“profile.managed_default_content_settings.images”: 2} chrome_options.add_experimental_option(“prefs”, prefs) driver = webdriver.Chrome(chrome_options=chrome_options) I found it below: http://nullege.com/codes/show/src@o@s@[email protected]/56/selenium.webdriver.ChromeOptions.add_experimental_option