which is faster for load: pickle or hdf5 in python

UPDATE: nowadays I would choose between Parquet, Feather (Apache Arrow), HDF5 and Pickle.

Pro’s and Contra’s:

Parquet
- pros
  - one of the fastest and widely supported binary storage formats
  - supports very fast compression methods (for example Snappy codec)
  - de-facto standard storage format for Data Lakes / BigData
- contras
  - the whole dataset must be read into memory. You can’t read a smaller subset. One way to overcome this problem is to use partitioning and to read only required partitions.
    - no support for indexing. you can’t read a specific row or a range of rows – you always have to read the whole Parquet file
  - Parquet files are immutable – you can’t change them (no way to append, update, delete), one can only either write or overwrite to Parquet file. Well this “limitation” comes from the BigData and would be considered as one of the huge “pros” there.
HDF5
- pros
  - supports data slicing – ability to read a portion of the whole dataset (we can work with datasets that wouldn’t fit completely into RAM).
  - relatively fast binary storage format
  - supports compression (though the compression is slower compared to Snappy codec (Parquet) )
  - supports appending rows (mutable)
- contras
  - risk of data corruption
Pickle
- pros
  - very fast
- contras
  - requires much space on disk
  - for a long term storage one might experience compatibility problems. You might need to specify the Pickle version for reading old Pickle files.

OLD Answer:

I would consider only two storage formats: HDF5 (PyTables) and Feather

Here are results of my read and write comparison for the DF (shape: 4000000 x 6, size in memory 183.1 MB, size of uncompressed CSV – 492 MB).

Comparison for the following storage formats: (CSV, CSV.gzip, Pickle, HDF5 [various compression]):

                  read_s  write_s  size_ratio_to_CSV
storage
CSV               17.900    69.00              1.000
CSV.gzip          18.900   186.00              0.047
Pickle             0.173     1.77              0.374
HDF_fixed          0.196     2.03              0.435
HDF_tab            0.230     2.60              0.437
HDF_tab_zlib_c5    0.845     5.44              0.035
HDF_tab_zlib_c9    0.860     5.95              0.035
HDF_tab_bzip2_c5   2.500    36.50              0.011
HDF_tab_bzip2_c9   2.500    36.50              0.011

But it might be different for you, because all my data was of the datetime dtype, so it’s always better to make such a comparison with your real data or at least with the similar data…

Pro’s and Contra’s:

More Related Contents:

Leave a Comment Cancel reply