I routinely use tens of gigabytes of data in just this fashion
e.g. I have tables on disk that I read via queries, create data and append back.
It's worth reading the docs and late in this thread for several suggestions for how to store your data.
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
- Size of data, # of rows, columns, types of columns; are you appending
rows, or just columns?
- What will typical operations look like. E.g. do a query on columns to select a bunch of rows and specific columns, then do an operation (in-memory), create new columns, save these.
(Giving a toy example could enable us to offer more specific recommendations.)
- After that processing, then what do you do? Is step 2 ad hoc, or repeatable?
- Input flat files: how many, rough total size in Gb. How are these organized e.g. by records? Does each one contains different fields, or do they have some records per file with all of the fields in each file?
- Do you ever select subsets of rows (records) based on criteria (e.g. select the rows with field A > 5)? and then do something, or do you just select fields A, B, C with all of the records (and then do something)?
- Do you 'work on' all of your columns (in groups), or are there a good proportion that you may only use for reports (e.g. you want to keep the data around, but don't need to pull in that column explicity until final results time)?
Solution
Ensure you have pandas at least 0.10.1
installed.
Read iterating files chunk-by-chunk and multiple table queries.
Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)
import numpy as np
import pandas as pd
# create a store
store = pd.HDFStore('mystore.h5')
# this is the key to your storage:
# this maps your fields to a specific group, and defines
# what you want to have as data_columns.
# you might want to create a nice class wrapping this
# (as you will want to have this map and its inversion)
group_map = dict(
A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']),
B = dict(fields = ['field_10',...... ], dc = ['field_10']),
.....
REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = []),
)
group_map_inverted = dict()
for g, v in group_map.items():
group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
Reading in the files and creating the storage (essentially doing what append_to_multiple
does):
for f in files:
# read in the file, additional options may be necessary here
# the chunksize is not strictly necessary, you may be able to slurp each
# file into memory in which case just eliminate this part of the loop
# (you can also change chunksize if necessary)
for chunk in pd.read_table(f, chunksize=50000):
# we are going to append to each table by group
# we are not going to create indexes at this time
# but we *ARE* going to create (some) data_columns
# figure out the field groupings
for g, v in group_map.items():
# create the frame for this group
frame = chunk.reindex(columns = v['fields'], copy = False)
# append it
store.append(g, frame, index=False, data_columns = v['dc'])
Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).
This is how you get columns and create new ones:
frame = store.select(group_that_I_want)
# you can optionally specify:
# columns = a list of the columns IN THAT GROUP (if you wanted to
# select only say 3 out of the 20 columns in this sub-table)
# and a where clause if you want a subset of the rows
# do calculations on this frame
new_frame = cool_function_on_frame(frame)
# to 'add columns', create a new group (you probably want to
# limit the columns in this new_group to be only NEW ones
# (e.g. so you don't overlap from the other tables)
# add this info to the group_map
store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
When you are ready for post_processing:
# This may be a bit tricky; and depends what you are actually doing.
# I may need to modify this function to be a bit more general:
report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).
You also might want to:
- create a function which takes a list of fields, looks up the groups in the groups_map, then selects these and concatenates the results so you get the resulting frame (this is essentially what select_as_multiple does). This way the structure would be pretty transparent to you.
- indexes on certain data columns (makes row-subsetting much faster).
- enable compression.
Let me know when you have questions!
Best Answer
HDF5 Advantages: Organization, flexibility, interoperability
Some of the main advantages of HDF5 are its hierarchical structure (similar to folders/files), optional arbitrary metadata stored with each item, and its flexibility (e.g. compression). This organizational structure and metadata storage may sound trivial, but it's very useful in practice.
Another advantage of HDF is that the datasets can be either fixed-size or flexibly sized. Therefore, it's easy to append data to a large dataset without having to create an entire new copy.
Additionally, HDF5 is a standardized format with libraries available for almost any language, so sharing your on-disk data between, say Matlab, Fortran, R, C, and Python is very easy with HDF. (To be fair, it's not too hard with a big binary array, too, as long as you're aware of the C vs. F ordering and know the shape, dtype, etc of the stored array.)
HDF advantages for a large array: Faster I/O of an arbitrary slice
Just as the TL/DR: For an ~8GB 3D array, reading a "full" slice along any axis took ~20 seconds with a chunked HDF5 dataset, and 0.3 seconds (best-case) to over three hours (worst case) for a memmapped array of the same data.
Beyond the things listed above, there's another big advantage to a "chunked"* on-disk data format such as HDF5: Reading an arbitrary slice (emphasis on arbitrary) will typically be much faster, as the on-disk data is more contiguous on average.
*
(HDF5 doesn't have to be a chunked data format. It supports chunking, but doesn't require it. In fact, the default for creating a dataset inh5py
is not to chunk, if I recall correctly.)Basically, your best case disk-read speed and your worst case disk read speed for a given slice of your dataset will be fairly close with a chunked HDF dataset (assuming you chose a reasonable chunk size or let a library choose one for you). With a simple binary array, the best-case is faster, but the worst-case is much worse.
One caveat, if you have an SSD, you likely won't notice a huge difference in read/write speed. With a regular hard drive, though, sequential reads are much, much faster than random reads. (i.e. A regular hard drive has long
seek
time.) HDF still has an advantage on an SSD, but it's more due its other features (e.g. metadata, organization, etc) than due to raw speed.First off, to clear up confusion, accessing an
h5py
dataset returns an object that behaves fairly similarly to a numpy array, but does not load the data into memory until it's sliced. (Similar to memmap, but not identical.) Have a look at theh5py
introduction for more information.Slicing the dataset will load a subset of the data into memory, but presumably you want to do something with it, at which point you'll need it in memory anyway.
If you do want to do out-of-core computations, you can fairly easily for tabular data with
pandas
orpytables
. It is possible withh5py
(nicer for big N-D arrays), but you need to drop down to a touch lower level and handle the iteration yourself.However, the future of numpy-like out-of-core computations is Blaze. Have a look at it if you really want to take that route.
The "unchunked" case
First off, consider a 3D C-ordered array written to disk (I'll simulate it by calling
arr.ravel()
and printing the result, to make things more visible):The values would be stored on-disk sequentially as shown on line 4 below. (Let's ignore filesystem details and fragmentation for the moment.)
In the best case scenario, let's take a slice along the first axis. Notice that these are just the first 36 values of the array. This will be a very fast read! (one seek, one read)
Similarly, the next slice along the first axis will just be the next 36 values. To read a complete slice along this axis, we only need one
seek
operation. If all we're going to be reading is various slices along this axis, then this is the perfect file structure.However, let's consider the worst-case scenario: A slice along the last axis.
To read this slice in, we need 36 seeks and 36 reads, as all of the values are separated on disk. None of them are adjacent!
This may seem pretty minor, but as we get to larger and larger arrays, the number and size of the
seek
operations grows rapidly. For a large-ish (~10Gb) 3D array stored in this way and read in viamemmap
, reading a full slice along the "worst" axis can easily take tens of minutes, even with modern hardware. At the same time, a slice along the best axis can take less than a second. For simplicity, I'm only showing "full" slices along a single axis, but the exact same thing happens with arbitrary slices of any subset of the data.Incidentally there are several file formats that take advantage of this and basically store three copies of huge 3D arrays on disk: one in C-order, one in F-order, and one in the intermediate between the two. (An example of this is Geoprobe's D3D format, though I'm not sure it's documented anywhere.) Who cares if the final file size is 4TB, storage is cheap! The crazy thing about that is that because the main use case is extracting a single sub-slice in each direction, the reads you want to make are very, very fast. It works very well!
The simple "chunked" case
Let's say we store 2x2x2 "chunks" of the 3D array as contiguous blocks on disk. In other words, something like:
So the data on disk would look like
chunked
:And just to show that they're 2x2x2 blocks of
arr
, notice that these are the first 8 values ofchunked
:To read in any slice along an axis, we'd read in either 6 or 9 contiguous chunks (twice as much data as we need) and then only keep the portion we wanted. That's a worst-case maximum of 9 seeks vs a maximum of 36 seeks for the non-chunked version. (But the best case is still 6 seeks vs 1 for the memmapped array.) Because sequential reads are very fast compared to seeks, this significantly reduces the amount of time it takes to read an arbitrary subset into memory. Once again, this effect becomes larger with larger arrays.
HDF5 takes this a few steps farther. The chunks don't have to be stored contiguously, and they're indexed by a B-Tree. Furthermore, they don't have to be the same size on disk, so compression can be applied to each chunk.
Chunked arrays with
h5py
By default,
h5py
doesn't created chunked HDF files on disk (I thinkpytables
does, by contrast). If you specifychunks=True
when creating the dataset, however, you'll get a chunked array on disk.As a quick, minimal example:
Note that
chunks=True
tellsh5py
to automatically pick a chunk size for us. If you know more about your most common use-case, you can optimize the chunk size/shape by specifying a shape tuple (e.g.(2,2,2)
in the simple example above). This allows you to make reads along a particular axis more efficient or optimize for reads/writes of a certain size.I/O Performance comparison
Just to emphasize the point, let's compare reading in slices from a chunked HDF5 dataset and a large (~8GB), Fortran-ordered 3D array containing the same exact data.
I've cleared all OS caches between each run, so we're seeing the "cold" performance.
For each file type, we'll test reading in a "full" x-slice along the first axis and a "full" z-slize along the last axis. For the Fortran-ordered memmapped array, the "x" slice is the worst case, and the "z" slice is the best case.
The code used is in a gist (including creating the
hdf
file). I can't easily share the data used here, but you could simulate it by an array of zeros of the same shape (621, 4991, 2600)
and typenp.uint8
.The
chunked_hdf.py
looks like this:memmapped_array.py
is similar, but has a touch more complexity to ensure the slices are actually loaded into memory (by default, anothermemmapped
array would be returned, which wouldn't be an apples-to-apples comparison).Let's have a look at the HDF performance first:
A "full" x-slice and a "full" z-slice take about the same amount of time (~20sec). Considering this is an 8GB array, that's not too bad. Most of the time
And if we compare this to the memmapped array times (it's Fortran-ordered: A "z-slice" is the best case and an "x-slice" is the worst case.):
Yes, you read that right. 0.3 seconds for one slice direction and ~3.5 hours for the other.
The time to slice in the "x" direction is far longer than the amount of time it would take to load the entire 8GB array into memory and select the slice we wanted! (Again, this is a Fortran-ordered array. The opposite x/z slice timing would be the case for a C-ordered array.)
However, if we're always wanting to take a slice along the best-case direction, the big binary array on disk is very good. (~0.3 sec!)
With a memmapped array, you're stuck with this I/O discrepancy (or perhaps anisotropy is a better term). However, with a chunked HDF dataset, you can choose the chunksize such that access is either equal or is optimized for a particular use-case. It gives you a lot more flexibility.
In summary
Hopefully that helps clear up one part of your question, at any rate. HDF5 has many other advantages over "raw" memmaps, but I don't have room to expand on all of them here. Compression can speed some things up (the data I work with doesn't benefit much from compression, so I rarely use it), and OS-level caching often plays more nicely with HDF5 files than with "raw" memmaps. Beyond that, HDF5 is a really fantastic container format. It gives you a lot of flexibility in managing your data, and can be used from more or less any programming language.
Overall, try it and see if it works well for your use case. I think you might be surprised.