mechanize - w3toppers.com

How to handle IncompleteRead: in python

The link you included in your question is simply a wrapper that executes urllib’s read() function, which catches any incomplete read exceptions for you. If you don’t want to implement this entire patch, you could always just throw in a try/catch loop where you read your links. For example: try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, … Read more

Click on a javascript link within python?

I mainly use HtmlUnit under jython for these use cases. Also I published a simple article on the subject: Web Scraping Ajax and Javascript sites.

Scrape the absolute URL instead of a relative path in python

urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code. >>> import urllib.parse >>> base=”https://www.example-page-xl.com” >>> urllib.parse.urljoin(base, “https://stackoverflow.com/helloworld/index.php”) ‘https://www.example-page-xl.com/helloworld/index.php’ >>> urllib.parse.urljoin(base, ‘https://www.example-page-xl.com/helloworld/index.php’) ‘https://www.example-page-xl.com/helloworld/index.php’

How do I parse an HTML table with Nokogiri?

#!/usr/bin/ruby1.8 require ‘nokogiri’ require ‘pp’ html = <<-EOS (The HTML from the question goes here) EOS doc = Nokogiri::HTML(html) rows = doc.xpath(‘//table/tbody[@id=”threadbits_forum_251″]/tr’) details = rows.collect do |row| detail = {} [ [:title, ‘td[3]/div[1]/a/text()’], [:name, ‘td[3]/div[2]/span/a/text()’], [:date, ‘td[4]/text()’], [:time, ‘td[4]/span/text()’], [:number, ‘td[5]/a/text()’], [:views, ‘td[6]/text()’], ].each do |name, xpath| detail[name] = row.at_xpath(xpath).to_s.strip end detail end pp details … Read more

mechanize python click a button

clicking a type=”button” in a pure html form does nothing. For it to do anything, there must be javascript involved. And mechanize doesn’t run javascript. So your options are: Read the javascript yourself and simulate with mechanize what it would be doing Use spidermonkey to run the javascript code I’d do the first one, since … Read more

Using WWW:Mechanize to download a file to disk without loading it all in memory first

What you really want is the Mechanize::Download http://mechanize.rubyforge.org/Mechanize/Download.html you can use this way: require ‘mechanize’ agent = Mechanize.new agent.pluggable_parser.default = Mechanize::Download agent.get(‘http://example.com/foo’).save(‘a_file_name’)

Using Python and Mechanize to submit form data and authenticate

I would definitely suggest trying to use the API if possible, but this works for me (not for your example post, which has been deleted, but for any active one): #!/usr/bin/env python import mechanize import cookielib import urllib import logging import sys def main(): br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) … Read more

What should I do if socket.setdefaulttimeout() is not working?

While socket.setsocketimeout will set the default timeout for new sockets, if you’re not using the sockets directly, the setting can be easily overwritten. In particular, if the library calls socket.setblocking on its socket, it’ll reset the timeout. urllib2.open has a timeout argument, hovewer, there is no timeout in urllib2.Request. As you’re using mechanize, you should … Read more

Mechanize and Javascript

I’ve played with this new alternative to Mechanize (which I love) called Phantom JS. It is a full web kit browser like Safari or Chrome but is headless and scriptable. You script it with javascript, not python (as far as I know at least). There are some example scripts to get you started. It’s a … Read more

Is there a PHP equivalent of Perl’s WWW::Mechanize?

SimpleTest’s ScriptableBrowser can be used independendly from the testing framework. I’ve used it for numerous automation-jobs.