Design Pattern for Splitting Files into Smaller Files

I am developing a project where I have to load very large files (upto 50 MB). Currently I am loading these files completely into (consecutive) memory. This has the advantage that I can very easily change bytes at certain locations, because I do not know the structure of all bytes.

However, my intention is to also change the structure, e.g. removing/adding 'chunks'. Now I have the idea to remove the 'known' parts out of it, store them in classes with a data chunk only containing those parts and make a sort of reference list to those chunks.

E.g.:

Original file:

Header
ChunkA 1
ChunkA 2
Intermediate
ChunkB 1
Footer

The result will be:

ChunkA 1 and ChunkA 2 instance.
ChunkB 1 instance

'File' instance and a reference with base offsets + reference to all chunks.

At the end I have to 'recreate' or write the original file (with changes) back.

Is this in general a good idea or is there some design pattern helping me in this?

Best Answer

First, I agree with the sentiment of some of the commenters above that what you need may simply be sensible data structures. Each data structure can fulfill a simple contract to both read in from some byte stream and write out to a byte stream. For files of 50MB size, that may be all you need. Take that into account with the rest of the answer.

However, I feel that you may be trying to pry at some deeper concepts here.

The first that comes to mind is efficiency with buffers. I believe a common trick here is to have preallocated buffer "parts" of a known size and use lists of the buffer "parts". In C#, the use of IList> comes to mind as an efficient wrapper around presumably preallocated arrays. See here. Note that these buffer sizes often have had affinity to disk sector size and memory page size as well. Efficient structure definition up front can allow for interesting optimizations later. For example, the TAR archive format uses a 512 byte header record for this sort of reason. If you're copying a file out of a TAR, your sector boundaries don't get messed up, which can be very nice.

Second, I have to wonder if a study of the design behind the rope for string handling might yield some insight. It follows a similar line of thought. This would be useful depending on your editing strategy.

Related Solutions

C# Text Processing – How to Load Text Files into Memory-Mapped Files

I do not think you will gain much, if anything, in performance, by using memory-mapped files instead of performing normal text-file processing. From the moment that you change the length of a single line even by just one byte, the remainder of the file will need to be read, shifted by one byte, and written back to disk. From the point of view of I/O, this is equivalent to normal text-file processing: Read a line, modify it, write it, repeat. And the headache of having to do by yourself all the text processing is probably not worth the hassle.

Have you established an acceptable performance metric for your system?

Have you tried the normal text file processing approach and found it to exceed that metric before starting to look for a more efficient solution?

C# Dependency Injection with Adapter Pattern

Right now what I see is the following chain of dependencies:

client->controller->adapter->reader->controller

That does create a circular dependency.

You use Dependency Injection to achieve Dependency Inversion. That is, that higher level code does not dependent on the implementation of lower level code.

Looking at the chain, we can attest that there is a piece of code, either the controller or the text reader, that is both high and low level.

If the client does need to use directly the text reader then it cannot be that the adapter uses it as well, as you would need to inject it further up. Rather you will be injecting the adapter into the text reader. And then injecting the text reader into the client.

But if we do that, and we keep the injection of the controller both on the client and the text reader, are we not, then, exposing incorrectly functionality? Because you are saying both that the text reader is of the same level than the controller, and that the controller is at a lower level than the text reader (as it is injected)

Finally, why does the client need to access the text reader used by the adapter? If there is really something that is needed you can provide a pass through method that doesn't expose the fact that you have the text reader somewhere down there.

Check which functionality is going to be lower level (which one actually does the writing/reading) and which one does the high level stuff. Order them higher to level and inject lower into higher.

Without looking at the code,I am imagining that the actual order would be

client->controller->adapter->text read/write

And that in fact the text reader/writer doesn't need to know about the controller because it is the one reading/writing directly from/to the device. Any information that the controller has that the text reader/writer needs will be passed through parameters.

Best Answer

Related Solutions

C# Text Processing – How to Load Text Files into Memory-Mapped Files

C# Dependency Injection with Adapter Pattern

Related Topic