Design Pattern for Splitting Files into Smaller Files

c

I am developing a project where I have to load very large files (upto 50 MB). Currently I am loading these files completely into (consecutive) memory. This has the advantage that I can very easily change bytes at certain locations, because I do not know the structure of all bytes.

However, my intention is to also change the structure, e.g. removing/adding 'chunks'. Now I have the idea to remove the 'known' parts out of it, store them in classes with a data chunk only containing those parts and make a sort of reference list to those chunks.

E.g.:

Original file:

  • Header
  • ChunkA 1
  • ChunkA 2
  • Intermediate
  • ChunkB 1
  • Footer

The result will be:

ChunkA 1 and ChunkA 2 instance.
ChunkB 1 instance

'File' instance and a reference with base offsets + reference to all chunks.

At the end I have to 'recreate' or write the original file (with changes) back.

Is this in general a good idea or is there some design pattern helping me in this?

Best Answer

First, I agree with the sentiment of some of the commenters above that what you need may simply be sensible data structures. Each data structure can fulfill a simple contract to both read in from some byte stream and write out to a byte stream. For files of 50MB size, that may be all you need. Take that into account with the rest of the answer.

However, I feel that you may be trying to pry at some deeper concepts here.

The first that comes to mind is efficiency with buffers. I believe a common trick here is to have preallocated buffer "parts" of a known size and use lists of the buffer "parts". In C#, the use of IList> comes to mind as an efficient wrapper around presumably preallocated arrays. See here. Note that these buffer sizes often have had affinity to disk sector size and memory page size as well. Efficient structure definition up front can allow for interesting optimizations later. For example, the TAR archive format uses a 512 byte header record for this sort of reason. If you're copying a file out of a TAR, your sector boundaries don't get messed up, which can be very nice.

Second, I have to wonder if a study of the design behind the rope for string handling might yield some insight. It follows a similar line of thought. This would be useful depending on your editing strategy.