Source interface with Python and urllib2

Unfortunately the stack of standard library modules in use (urllib2, httplib, socket) is somewhat badly designed for the purpose — at the key point in the operation, HTTPConnection.connect (in httplib) delegates to socket.create_connection, which in turn gives you no “hook” whatsoever between the creation of the socket instance sock and the sock.connect call, for you … Read more

Making HTTP HEAD request with urllib2 from Python 2

This works just fine: import urllib2 request = urllib2.Request(‘http://localhost:8080’) request.get_method = lambda : ‘HEAD’ response = urllib2.urlopen(request) print response.info() Tested with quick and dirty HTTPd hacked in python: Server: BaseHTTP/0.3 Python/2.6.6 Date: Sun, 12 Dec 2010 11:52:33 GMT Content-type: text/html X-REQUEST_METHOD: HEAD I’ve added a custom header field X-REQUEST_METHOD to show it works 🙂 Here … Read more

Are urllib2 and httplib thread safe?

httplib and urllib2 are not thread-safe. urllib2 does not provide serialized access to a global (shared) OpenerDirector object, which is used by urllib2.urlopen(). Similarly, httplib does not provide serialized access to HTTPConnection objects (i.e. by using a thread-safe connection pool), so sharing HTTPConnection objects between threads is not safe. I suggest using httplib2 or urllib3 … Read more

Requests, bind to an ip

Looking into the requests module, it looks like it uses httplib to send the http requests. httplib uses socket.create_connection() to connect to the www host. Knowing that and following the monkey patching method in the link you provided: import socket real_create_conn = socket.create_connection def set_src_addr(*args): address, timeout = args[0], args[1] source_address = (‘IP_ADDR_TO_BIND_TO’, 0) return … Read more

opening websites using urllib2 from behind corporate firewall – 11004 getaddrinfo failed

If you are using Proxy and that proxy has Username and Password (which many corporate proxies have), you need to set the proxy handler with urllib2. proxy_url=”http://” + proxy_user + ‘:’ + proxy_password + ‘@’ + proxy_ip proxy_support = urllib2.ProxyHandler({“http”:proxy_url}) opener = urllib2.build_opener(proxy_support,urllib2.HTTPHandler) urllib2.install_opener(opener) HTTPBasicAuthHandler is used to provide credentials for the site which you … Read more

urllib2.URLError:

The problem, in my case, was that some install at some point defined an environment variable http_proxy on my machine when I had no proxy. Removing the http_proxy environment variable fixed the problem.

Tell urllib2 to use custom DNS

Looks like name resolution is ultimately handled by socket.create_connection. -> urllib2.urlopen -> httplib.HTTPConnection -> socket.create_connection Though once the “Host:” header has been set, you can resolve the host and pass on the IP address through down to the opener. I’d suggest that you subclass httplib.HTTPConnection, and wrap the connect method to modify self.host before passing … Read more

How to download any(!) webpage with correct charset in python?

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted: fp = urllib2.urlopen(request) charset = fp.headers.getparam(‘charset’) You can use BeautifulSoup to locate a meta element in the HTML: soup = BeatifulSoup.BeautifulSoup(data) meta = soup.findAll(‘meta’, {‘http-equiv’:lambda v:v.lower()==’content-type’}) If neither is available, browsers typically fall back to user … Read more