C++ – How to Increase the Size of a Memory Mapped File

cfilesmemory

I am maintaning a memory mapped file to store my tree like datastructure.

When I'm updating the datastructure ,I got this problem.
The file is limited on it's size and can't be too long or too small.

I have a methods like

 void mapfile_insert_record(RECORD* /* record*/);

 void mapfile_modify_record(RECORD* /* record*/);

Both operations could lead to exceed the space which is free on memory file.

How do I overcome this?
What strategy I should use.

calculate whether it requires to exceed the file as a pre-condition on both
methods.
Dynamically exceed it , for a example manage a timer and constantly polling file
for it's free avaliable size and then automatically extend it.

Any ideas or patterns to overcome this problem?

Best Answer

Polling is not a good idea. It would only makes sense if the file is modified by other processes but even then it is better to extend only before insert/modify. Besides, concurrent access is dangerous, as you have to worry about synchronization.

Assuming that your file is never shrunk, you can cache the size of the file, so you only have to check the file size if your current request is outside of the last seen file size.

By the way, you describe version 2) as "Dynamically exceed it...". For me, version 1) also dynamically extends the file. The only difference is that 1) extends only on-demand, whereas 2) extends eagerly.

Unless there are some additional constraints that we do not know I would recommend to keep it simple and stick with version 1).

Related Solutions

C++ – Key / Value store development porting to modern C++

The approach I would recommend is to focus on the interface of your key-value store, so as to make it as clean as possible and as nonrestrictive as possible, meaning that it should allow maximum freedom to the callers, but also maximum freedom for choosing how to implement it.

Then, I would recommend that you provide an as bare as possible, and as clean as possible implementation, without any performance concerns whatsoever. To me it seems like unordered_map should be your first choice, or perhaps map if some kind of ordering of keys must be exposed by the interface.

So, first get it to work cleanly and minimally; then, put it to use in a real application; in doing so, you will find what issues you need to address on the interface; then, go ahead and address them. Most chances are that as a result of changing the interface, you will need to rewrite big parts of the implementation, so any time you have already invested on the first iteration of the implementation beyond the bare minimum amount of time necessary to get it to just barely work is time wasted.

Then, profile it, and see what needs to be improved in the implementation, without altering the interface. Or you may have your own ideas about how to improve the implementation, before you even profile. That's fine, but it is still no reason to work on these ideas at any earlier point in time.

You say you hope to do better than map; there are two things that can be said about that:

a) you probably won't;

b) avoid premature optimization at all costs.

With respect to the implementation, your main issue appears to be memory allocation, since you seem to be concerned with how to structure your design in order to work around problems that you foresee that you are going to have with respect to memory allocation. The best way to address memory allocation concerns in C++ is by implementing a suitable memory allocation management, not by twisting and bending the design around them. You should consider yourself lucky that you are using C++, which allows you to do your own memory allocation management, as opposed to languages like Java and C#, where you are pretty much stuck with what the language runtime has to offer.

There are various ways of going about memory management in C++, and the ability to overload the new operator may come in handy. A simplistic memory allocator for your project would preallocate a huge array of bytes and use it as a heap. (byte* heap.) You would have a firstFreeByte index, initialized to zero, which indicates the first free byte in the heap. When a request for N bytes comes in, you return the address heap + firstFreeByte and you add N to firstFreeByte. So, memory allocation becomes so fast and efficient that it becomes virtually no issue.

Of course, preallocating all of your memory may not be a good idea, so you may have to break your heap into banks which are allocated on demand, and keep serving allocation requests from the at-any-given-moment-newest bank.

Since your data are immutable, this is a good solution. It allows you to abandon the idea of variable length objects, and to have each Pair contain a pointer to its data as it should, since the extra memory allocation for the data costs virtually nothing.

If you want to be able to discard objects from the heap, so as to be able to reclaim their memory, then things become more complicated: you will need to be using not pointers, but pointers to pointers, so that you can always move objects around in the heaps so as to reclaim the space of deleted objects. Everything becomes a bit slower due to the extra indirection, but everything is still lightning fast compared to using standard runtime library memory allocation routines.

But all this is of course really useless to be concerned with if you don't first build a straightforward, bare-minimal, working version of your database, and put it to use in a real application.

C++ Memory Management – Understanding Memory Segmentation: Stack, Heap, etc.

The stack and heap in C/C++ describe different mechanisms of memory allocation. They can also be called “automatic storage” and “free store”. If you allocate data on the free store/heap, you are responsible for managing the lifetime (calling delete or free()). This is mostly unrelated to memory pages.

Memory pages are a block of virtual addresses. The virtual address space of a process is created by the operating system by mapping pages into the address space.

A process uses different areas of the address space differently. One area will be the stack. Other areas will hold the contents of executable files and libraries. These files may have different segments, e.g. for executable code, for constants, and space for variables.

Here I've pulled the address space mappings of a Perl interpreter using pmap:

0000000000400000   1776K r-x-- perl
00000000007bb000      4K r---- perl
00000000007bc000     12K rw--- perl
00000000007bf000      4K rw---   [ anon ]
0000000001eff000   1192K rw---   [ anon ]
00007f00184b7000   4464K r---- locale-archive
00007f0018913000   1792K r-x-- libc-2.23.so
00007f0018ad3000   2048K ----- libc-2.23.so
00007f0018cd3000     16K r---- libc-2.23.so
00007f0018cd7000      8K rw--- libc-2.23.so
00007f0018cd9000     16K rw---   [ anon ]
00007f0018cdd000     36K r-x-- libcrypt-2.23.so
00007f0018ce6000   2044K ----- libcrypt-2.23.so
00007f0018ee5000      4K r---- libcrypt-2.23.so
00007f0018ee6000      4K rw--- libcrypt-2.23.so
00007f0018ee7000    184K rw---   [ anon ]
00007f0018f15000   1056K r-x-- libm-2.23.so
00007f001901d000   2044K ----- libm-2.23.so
00007f001921c000      4K r---- libm-2.23.so
00007f001921d000      4K rw--- libm-2.23.so
00007f001921e000     12K r-x-- libdl-2.23.so
00007f0019221000   2044K ----- libdl-2.23.so
00007f0019420000      4K r---- libdl-2.23.so
00007f0019421000      4K rw--- libdl-2.23.so
00007f0019422000     96K r-x-- libpthread-2.23.so
00007f001943a000   2044K ----- libpthread-2.23.so
00007f0019639000      4K r---- libpthread-2.23.so
00007f001963a000      4K rw--- libpthread-2.23.so
00007f001963b000     16K rw---   [ anon ]
00007f001963f000    152K r-x-- ld-2.23.so
00007f0019839000     20K rw---   [ anon ]
00007f0019864000      4K r---- ld-2.23.so
00007f0019865000      4K rw--- ld-2.23.so
00007f0019866000      4K rw---   [ anon ]
00007ffc0cd0a000    136K rw---   [ stack ]
00007ffc0cd4e000     12K r----   [ anon ]
00007ffc0cd51000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total            21284K

Note that the smallest size is 4K, that is the page size on my system.

We can see on the rightmost column which files (executables or libraries) were mapped into the address space at which offset. There are also a couple of special regions, such as [stack]. Some of the anonymous regions can be used as a free store/heap. There may be gaps between the mapped ranges of the address space. Trying to access memory in an unmapped range will cause a segfault.

Each of those regions consists of one or more pages. This is important, because pages can have access protections: the executables and libraries provide executable pages for code, read-only pages for constants, and read-write pages for variables. This is a security mechanism to avoid arbitrary data from being executed (though this doesn't matter very much for an interpreter). The pages for the heap and stack will be readable and writeable, but not executable.

An address range may be mapped, but the page for that address might not exist at the moment. Using such an address will trigger a page fault. The operating system can intercept the page fault and add the page. For example, not all pages of the 136K for the stack may exist when the process starts. Instead, the pages are added on demand. Or, a page may have been swapped out to disk. A page fault will cause that page to be loaded back into physical memory.

Best Answer

Related Solutions

C++ – Key / Value store development porting to modern C++

C++ Memory Management – Understanding Memory Segmentation: Stack, Heap, etc.

Related Topic