Ah here comes yet another discussion about the GIL. Well here’s the thing. Fetching content with urllib2 is going to be mostly IO-bound. Native threading AND multiprocessing will both have the same performance when the task is IO-bound (threading only becomes a problem when it’s CPU-bound). Yes you can speed it up, I’ve done it myself using python threads and something like 10 downloader threads.
Basically you use a producer-consumer model with one thread (or process) producing urls to download, and N threads (or processes) consuming from that queue and making requests to the server.
Here’s some pseudo-code:
# Make sure that the queue is thread-safe!!
def producer(self):
# Only need one producer, although you could have multiple
with fh = open('urllist.txt', 'r'):
for line in fh:
self.queue.enqueue(line.strip())
def consumer(self):
# Fire up N of these babies for some speed
while True:
url = self.queue.dequeue()
dh = urllib2.urlopen(url)
with fh = open('/dev/null', 'w'): # gotta put it somewhere
fh.write(dh.read())
Now if you’re downloading very large chunks of data (hundreds of MB) and a single request completely saturates the bandwidth, then yes running multiple downloads is pointless. The reason you run multiple downloads (generally) is because requests are small and have a relatively high latency / overhead.