Windows memory limitation
Memory errors happens a lot with python when using the 32bit version in Windows. This is because 32bit processes only gets 2GB of memory to play with by default.
Tricks for lowering memory usage
If you are not using 32bit python in windows but are looking to improve on your memory efficiency while reading csv files, there is a trick.
The pandas.read_csv function takes an option called dtype
. This lets pandas know what types exist inside your csv data.
How this works
By default, pandas will try to guess what dtypes your csv file has. This is a very heavy operation because while it is determining the dtype, it has to keep all raw data as objects (strings) in memory.
Example
Let’s say your csv looks like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
This example is of course no problem to read into memory, but it’s just an example.
If pandas were to read the above csv file without any dtype option, the age would be stored as strings in memory until pandas has read enough lines of the csv file to make a qualified guess.
I think the default in pandas is to read 1,000,000 rows before guessing the dtype.
Solution
By specifying dtype={'age':int}
as an option to the .read_csv()
will let pandas know that age should be interpreted as a number. This saves you lots of memory.
Problem with corrupt data
However, if your csv file would be corrupted, like this:
name, age, birthday
Alice, 30, 1985-01-01
Bob, 35, 1980-01-01
Charlie, 25, 1990-01-01
Dennis, 40+, None-Ur-Bz
Then specifying dtype={'age':int}
will break the .read_csv()
command, because it cannot cast "40+"
to int. So sanitize your data carefully!
Here you can see how the memory usage of a pandas dataframe is a lot higher when floats are kept as strings:
Try it yourself
df = pd.DataFrame(pd.np.random.choice(['1.0', '0.6666667', '150000.1'],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 224544 (~224 MB)
df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1],(100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# 79560 (~79 MB)