C++ – How to Make a Random-Access Archive of Text Files

cdesignfile handling

I wrote an application that tests the performance of evolutionary algorithms. This application performs a run of the algorithm which consists of several generations. The data which is produced by my application looks like this:

run1.run             // text file containing metadata
run1_data            // folder containing experimental data
   -statistics1      // text file containing some specific statistics
   -statistics2     
   -generation0     
   -generation10     // snapshot of the algorithm at generation10
   -generation20
   ...

Once this data is written, it is never changed.

When I want to examine the data, my application reads the metadata file (.run), opens the _data directory and reads the rest of the data.

This was all fine until recently. I now have hundreds of thousands of these files and I ran out of inodes on my system and also the loading of the data and copying is extremely slow, even though there are only a few gigabytes. My data seems to be too fragmented, since the files are quite small.

My application is written in C++ and uses the Qt library for filesystem operations. I was thiking of using the <system> header to issue a tar command to archive data after writing and un-archive before reading, but I found out that tar must read the entire archive to find the contents. This is a problem for me, since to save operating memory and time I sometimes load only Statistics1, sometimes only generation10

I was considering to change the format of my data so that there would be only one file, which would have something like a table of contents at the beginning, followed by all data files concatenated. The table of contents would indicate the beginning and end of each concatenated file. However I am not sure if this is a good solutions since the std::ifstream class that I use to read the files cannot make random jumps.

I am a beginner programmer and I do not want to waste a lot of time on developing something that would not work so I ask for any advice on how to solve my problem.

Best Answer

You could consider using some indexed file library like gdbm (or something else).

You could also perhaps consider using sqlite (it is a bit of overkill, but learning some tiny SQL skills is useful!) - or even using a real database system (e.g. postgresql or mongodb). Don't forget to backup & dump the data in database (i.e. SQL) format.

You might also be interested in textual serialization formats like JSON (there are libraries for them, e.g. jsoncpp, and it is nice to handle textual data). You could put JSON data inside GDBM or sqlite containers (see this example of mine).

BTW, if you want to keep having many files, perhaps organizing them in directories (i.e. dir01/data0020 ....) might help.

You probably would want to make a helper application to browse or access your data... Think also about backing up your data in textual (not binary) format!

There are some libraries handling tar format like libtar but I guess you should not use them.

Look also if your field (evolutionary algorithms) has not defined some conventions or formats. Document your format (even for yourself!).

Related Topic