What’s the most efficient way to convert a MySQL result set to a NumPy array?

This solution uses Kieth’s fromiter technique, but handles the two dimensional table structure of SQL results more intuitively. Also, it improves on Doug’s method by avoiding all the reshaping and flattening in python data types. Using a structured array we can read pretty much directly from the MySQL result into numpy, cutting out python data types almost entirely. I say ‘almost’ because the fetchall iterator still produces python tuples.

There is one caveat though, but it’s not a biggie. You must know the data type of your columns and the number of rows in advance.

Knowing the column types should be obvious, since you know what the query is presumably, otherwise you can always use curs.description, and a map of the MySQLdb.FIELD_TYPE.* constants.

Knowing the row count means you have to use client side cursor (which is the default). I don’t know enough about the internals of MySQLdb and the MySQL client libraries, but my understanding is that the entire result is fetched into client side memory when using client side cursors, although I suspect there’s actually some buffering and caching involved. This would mean using double memory for the result, once for the cursor copy and once for the array copy, so it’s probably a good idea to close the cursor as soon as possible to free up the memory if the result set is large.

Strictly speaking, you don’t have to provide the number of rows in advance, but doing so means the array memory is allocated once off in advance, and not continuously resized as more rows come in from the iterator which is meant to provide a huge performance boost.

And with that, some code

import MySQLdb
import numpy

conn = MySQLdb.connect(host="localhost", user="bob", passwd='mypasswd', db='bigdb')
curs = conn.cursor() #Use a client side cursor so you can access curs.rowcount
numrows = curs.execute("SELECT id, rating FROM video")

#curs.fetchall() is the iterator as per Kieth's answer
#count=numrows means advance allocation
#dtype="i4,i4" means two columns, both 4 byte (32 bit) integers
A = numpy.fromiter(curs.fetchall(), count=numrows, dtype=('i4,i4'))

print A #output entire array
ids = A['f0'] #ids = an array of the first column
              #(strictly speaking it's a field not column)
ratings = A['f1'] #ratings is an array of the second colum

See the numpy documentation for dtype and the link above about structured arrays for how to specify column data types, and column names.

Leave a Comment