save numpy array in append mode

The build-in .npy file format is perfectly fine for working with small datasets, without relying on external modules other then numpy.

However, when you start having large amounts of data, the use of a file format, such as HDF5, designed to handle such datasets, is to be preferred [1].

For instance, below is a solution to save numpy arrays in HDF5 with PyTables,

Step 1: Create an extendable EArray storage

import tables
import numpy as np

filename="outarray.h5"
ROW_SIZE = 100
NUM_COLUMNS = 200

f = tables.open_file(filename, mode="w")
atom = tables.Float64Atom()

array_c = f.create_earray(f.root, 'data', atom, (0, ROW_SIZE))

for idx in range(NUM_COLUMNS):
    x = np.random.rand(1, ROW_SIZE)
    array_c.append(x)
f.close()

Step 2: Append rows to an existing dataset (if needed)

f = tables.open_file(filename, mode="a")
f.root.data.append(x)

Step 3: Read back a subset of the data

f = tables.open_file(filename, mode="r")
print(f.root.data[1:10,2:20]) # e.g. read from disk only this part of the dataset

Leave a Comment