web-scraping - w3toppers.com

Capture AJAX response with selenium and python

I once intercepted some ajax calls injecting javascript to the page using selenium. The bad side of the history is that selenium could sometimes be, let’s say “fragile”. So for no reason I got selenium exceptions while doing this injection. Anyway, my idea was intercept the XHR calls, and set its response to a new … Read more

How to handle IncompleteRead: in python

The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example: try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, … Read more

find a word on a website and get its page link

Main problem is wrong allowed_domain – it has to be without path / allowed_domains = [“www.reichelt.com”] Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5 but newest version is 2.5.0). It also uses some useless lines of code. It gets contenttype but never use it to … Read more

How to scrape dynamic webpages by Python

you can use selenium like below sample: from selenium import webdriver driver = webdriver.Firefox() driver.get(‘http://example.com’) element = driver.find_element_by_class_name(“yourClassName”) #or find by text or etc element.click()

Download all pdf files from a website using Python

Check out the following implementation. I’ve used requests module instead of urllib to do the download. Moreover, I’ve used .select() method instead of .find_all() to avoid using re. import os import requests from urllib.parse import urljoin from bs4 import BeautifulSoup url = “http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html” #If there is no such folder, the script will create one automatically … Read more

Beautiful Soup and Table Scraping – lxml vs html parser

Short answer. If you already installed lxml, just use it. html.parser – BeautifulSoup(markup, “html.parser”) Advantages: Batteries included, Decent speed, Lenient (as of Python 2.7.3 and 3.2.) Disadvantages: Not very lenient (before Python 2.7.3 or 3.2.2) lxml – BeautifulSoup(markup, “lxml”) Advantages: Very fast, Lenient Disadvantages: External C dependency html5lib – BeautifulSoup(markup, “html5lib”) Advantages: Extremely lenient, Parses … Read more

How can I scrape a page with dynamic content (created by JavaScript) in Python?

EDIT Sept 2021: phantomjs isn’t maintained any more, either EDIT 30/Dec/2017: This answer appears in top results of Google searches, so I decided to update it. The old answer is still at the end. dryscape isn’t maintained anymore and the library dryscape developers recommend is Python 2 only. I have found using Selenium’s python library … Read more

XHR request URL says does not exist when attempting to parse it’s content

You have several problems: the url should be http://www.whoscored.com/stageplayerstatfeed wrong GET parameters missing important required headers you need response.json(), not response.body The fixed version: import requests url=”http://www.whoscored.com/stageplayerstatfeed” params = { ‘field’: ‘1’, ‘isAscending’: ‘false’, ‘orderBy’: ‘Rating’, ‘playerId’: ‘-1’, ‘stageId’: ‘9155’, ‘teamId’: ’32’ } headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, … Read more

post request using python to asp.net page

Where did you get the value viewstate and eventvalidation? On one hand, they shouldn’t end with “…”, you must have omitted something. On the other hand, they shouldn’t be hard-coded. One solution is like this: Retrieve the page via URL “http://www.indiapost.gov.in/pin/” without any form data Parse and retrieve the form values like __VIEWSTATE and __EVENTVALIDATION … Read more

Scraping Javascript driven web pages with PyQt4 – how to access pages that need authentication?

I figured it out. Here’s what I ended up with in case it can help someone else. #!/usr/bin/python # -*- coding: latin-1 -*- import sys import base64 from PyQt4.QtGui import * from PyQt4.QtCore import * from PyQt4.QtWebKit import * from PyQt4 import QtNetwork class Render(QWebPage): def __init__(self, url): self.app = QApplication(sys.argv) username=”username” password = ‘password’ … Read more