Most efficient way to parse a large .csv in python?

As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. I imagine the majority of the speed difference seen in CSV is due to that.

The CSV module is invaluable if you include any textual data that may include more of the ‘standard’ CSV syntax than just commas, especially if you’re reading from an Excel format.

If you’ve just got lines like “1,2,3,4” you’re probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\'s fred'" you’re going to go crazy trying to parse that without errors.

CSV will also transparently handle things like newlines in the middle of a quoted string.
A simple for..in without CSV is going to have trouble with that.

The CSV module has always worked fine for me reading unicode strings if I use it like so:

f = csv.reader(codecs.open(filename, 'rU'))

It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times.

I’d try using it first and only looking for optimizations on top of it if you really need the extra speed.

Leave a Comment