Saving to hdf5 is very slow (Python freezing)

Writing Data to HDF5

If you write to a chunked dataset without specifying a chunkshape, h5py will do that automaticly for you. Since h5py can’t know how do you wan’t to write or read the data from the dataset, this will often end up in a bad performance.

You also use the default chunk-cache-size of 1 MB. If you only write to a part of a chunk and the chunk doesn’t fit in the cache (which is very likely with 1MP chunk-cache-size), the whole chunk will be read in memory, modified and written back to disk. If that happens multiple times you will see a performance which is far beyond the sequential IO-speed of your HDD/SSD.

In the following example I assume that you only read or write along your first dimension. If not this has to be modified to your needs.

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path="Test.h5"
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading 
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction

f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))

Edit
The data creation in the loop took quite a lot of time, so I create the data before the time measurement.

This should give at least 900 MB/s throuput (CPU limited). With real data and lower compression ratios, you should easily reach the sequential IO-speed of your harddisk.

Open a HDF5-File with the with statement can also lead to bad performance if you make the mistake to call this block multiple times. This would close and reopen the file, deleting the chunk-cache.

For determination of the right chunk-size I would also recommend:
https://stackoverflow.com/a/48405220/4045774
https://stackoverflow.com/a/44961222/4045774

Leave a Comment