At the moment, I am using h5py to generate hdf5 datasets. I have something like this
import h5py
import numpy as np
my_data=np.genfromtxt("/tmp/data.csv",delimiter=",",dtype=None,names=True)
myFile="/tmp/f.hdf"
with h5py.File(myFile,"a") as f:
dset = f.create_dataset('%s/%s'%(vendor,dataSet),data=my_data,compression="gzip",compression_opts=9)
This works well for a relatively large ASCII file (400MB). I would like to do the same for a even larger dataset (40GB). Is there a better or more efficient way to do this with h5py? I want to avoid loading the entire data set into memory.
Some information about the data:
- I won't know the type of the data. Ideally, I would like to use
dtype=None
fromnp.loadtxt()
- I won't know the size (dimensions) of the file. They vary
Best Answer
You could infer the dtypes of your data by reading a smaller chunk of rows at the start of the text file. Once you have these, you can create a resizable HDF5 dataset and iteratively write chunks of rows from your text file to it.
Here's a generator that yields successive chunks of rows from a text file as numpy arrays:
Now suppose we have a
.csv
file containing:We can read this data in chunks of 5 rows at a time, and write the resulting arrays to a resizeable dataset:
Output:
For your dataset you will probably want to use a larger chunksize.