screen-scraping - w3toppers.com

What’s a good tool to screen-scrape with Javascript support? [closed]

You could use Selenium or Watir to drive a real browser. Ther are also some JavaScript-based headless browsers: PhantomJS is a headless Webkit browser. pjscrape is a scraping framework based on PhantomJS and jQuery. CasperJS is a navigation scripting & testing utility bsaed on PhantomJS, if you need to do a little more than point … Read more

How do I prevent site scraping? [closed]

Note: Since the complete version of this answer exceeds Stack Overflow’s length limit, you’ll need to head to GitHub to read the extended version, with more tips and details. In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers … Read more

Executing Javascript from Python

You can also use Js2Py which is written in pure python and is able to both execute and translate javascript to python. Supports virtually whole JavaScript even labels, getters, setters and other rarely used features. import js2py js = “”” function escramble_758(){ var a,b,c a=”+1 ” b=’84-‘ a+=’425-‘ b+=’7450’ c=”9″ document.write(a+c+b) } escramble_758() “””.replace(“document.write”, “return … Read more

Is there a PHP equivalent of Perl’s WWW::Mechanize?

SimpleTest’s ScriptableBrowser can be used independendly from the testing framework. I’ve used it for numerous automation-jobs.

Headless Browser for Python (Javascript support REQUIRED!) [closed]

I use webkit as a headless browser in Python via pyqt / pyside: http://www.riverbankcomputing.co.uk/software/pyqt/download http://developer.qt.nokia.com/wiki/Category:LanguageBindings::PySide::Downloads I particularly like webkit because it is simple to setup. For Ubuntu you just use: sudo apt-get install python-qt4 Here is an example script: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

How to implement a web scraper in PHP? [closed]

Scraping generally encompasses 3 steps: first you GET or POST your request to a specified URL next you receive the html that is returned as the response finally you parse out of that html the text you’d like to scrape. To accomplish steps 1 and 2, below is a simple php class which uses Curl … Read more

jsoup posting and cookie

When you login to the site, it is probably setting an authorised session cookie that needs to be sent on subsequent requests to maintain the session. You can get the cookie like this: Connection.Response res = Jsoup.connect(“http://www.example.com/login.php”) .data(“username”, “myUsername”, “password”, “myPassword”) .method(Method.POST) .execute(); Document doc = res.parse(); String sessionId = res.cookie(“SESSIONID”); // you will need … Read more

scrape html generated by javascript with python

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice. You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically … Read more

Web scraping with Python [closed]

Use urllib2 in combination with the brilliant BeautifulSoup library: import urllib2 from BeautifulSoup import BeautifulSoup # or if you’re using BeautifulSoup4: # from bs4 import BeautifulSoup soup = BeautifulSoup(urllib2.urlopen(‘http://example.com’).read()) for row in soup(‘table’, {‘class’: ‘spad’})[0].tbody(‘tr’): tds = row(‘td’) print tds[0].string, tds[1].string # will print date and sunrise

PhantomJS failing to open HTTPS site

I tried Fred’s and Cameron Tinker’s answers, but only –ssl-protocol=any option seem to help me: phantomjs –ssl-protocol=any test.js Also I think it should be way safer to use –ssl-protocol=any as you still are using encryption, but –ignore-ssl-errors=true will ignore (duh) all ssl errors, including malicious ones.