StackOverflow would be a somewhat better place for asking programming/algorithm related questions.
In any case, the implementation you must have read would be based on "tables". Here is how such implementation works:
- Initialize vector with size n, say n = 16
Address: 0xAAA0 to 0xAAB0 Memory reserved
- Insert 17 elements. First 16 inserted fine. Next element requires more memory.
STL Library: Allocate memory for 16 * 2 = 32. Copy 16 elements. (Actual time taken = 16 units). Insert the 17th element.
- Insert 16 more elements. First 15 inserted fine. Next element requires more space.
STL Library: Allocate memory for 16 * 2 * 2 = 64. Copy 32 elements. (Actual time taken = 32 units).
Insert the 33rd element.
- Insert 32 more elements. First 31 inserted fine. Next requires more space.
STL Library: Allocate memory for 16 * 2 * 2 * 2 = 128. Copy 64 elements. (Actual time taken = 64 units).
Insert the 65th element.
This implementation is O(1) for accessing and O(1) amortized for insertion. How? Over a very large number of operations, the total time of inserts would be:
Time = 2^0 (inserts) + 2^0 (copy) + 2^1 / 2 ( inserts ) + 2^1(copy) + 2^2/2 (inserts) + 2^2 (copy) ... .. + 2^n(copy)
- Total number of inserts = 2^n
Time = 2^0 + 2^0 + 2^0 + 2^1 + 2^1 + 2^2 + 2^2 = 1 + 2*2^0 + 2*2^1 +...+2*2^(n-1)
= 1 + 2*(2^n - 1)
Average time per insert = 2 units
- Total inserts = 2^n + 1
Time = 2^0 + 2^0 + 2^0 + 2^1 + 2^1 + 2^2 + 2^2 + 2^3 ... = 1 + 2*2^0 + 2*2^1 +...+2*2^(n-1) + 2^n
= 1 + 2*(2^n - 1) + 2^n
Average time per insert = 3 units
Its not a linked list- but I'm pretty this is what you read: 'increasing/decreasing' sizes. Decrease in size upon deletes is similar. When used size is less than 1/4, free up the rest of memory, free up half of memory. If you're using your own memory allocator, it shouldn't be too hard to free only what you want. But if you want to copy over as well, analysis would tell you deletes still remain O(1)
The approach I would recommend is to focus on the interface of your key-value store, so as to make it as clean as possible and as nonrestrictive as possible, meaning that it should allow maximum freedom to the callers, but also maximum freedom for choosing how to implement it.
Then, I would recommend that you provide an as bare as possible, and as clean as possible implementation, without any performance concerns whatsoever. To me it seems like unordered_map
should be your first choice, or perhaps map
if some kind of ordering of keys must be exposed by the interface.
So, first get it to work cleanly and minimally; then, put it to use in a real application; in doing so, you will find what issues you need to address on the interface; then, go ahead and address them. Most chances are that as a result of changing the interface, you will need to rewrite big parts of the implementation, so any time you have already invested on the first iteration of the implementation beyond the bare minimum amount of time necessary to get it to just barely work is time wasted.
Then, profile it, and see what needs to be improved in the implementation, without altering the interface. Or you may have your own ideas about how to improve the implementation, before you even profile. That's fine, but it is still no reason to work on these ideas at any earlier point in time.
You say you hope to do better than map
; there are two things that can be said about that:
a) you probably won't;
b) avoid premature optimization at all costs.
With respect to the implementation, your main issue appears to be memory allocation, since you seem to be concerned with how to structure your design in order to work around problems that you foresee that you are going to have with respect to memory allocation. The best way to address memory allocation concerns in C++ is by implementing a suitable memory allocation management, not by twisting and bending the design around them. You should consider yourself lucky that you are using C++, which allows you to do your own memory allocation management, as opposed to languages like Java and C#, where you are pretty much stuck with what the language runtime has to offer.
There are various ways of going about memory management in C++, and the ability to overload the new
operator may come in handy. A simplistic memory allocator for your project would preallocate a huge array of bytes and use it as a heap. (byte* heap
.) You would have a firstFreeByte
index, initialized to zero, which indicates the first free byte in the heap. When a request for N
bytes comes in, you return the address heap + firstFreeByte
and you add N
to firstFreeByte
. So, memory allocation becomes so fast and efficient that it becomes virtually no issue.
Of course, preallocating all of your memory may not be a good idea, so you may have to break your heap into banks which are allocated on demand, and keep serving allocation requests from the at-any-given-moment-newest bank.
Since your data are immutable, this is a good solution. It allows you to abandon the idea of variable length objects, and to have each Pair
contain a pointer to its data as it should, since the extra memory allocation for the data costs virtually nothing.
If you want to be able to discard objects from the heap, so as to be able to reclaim their memory, then things become more complicated: you will need to be using not pointers, but pointers to pointers, so that you can always move objects around in the heaps so as to reclaim the space of deleted objects. Everything becomes a bit slower due to the extra indirection, but everything is still lightning fast compared to using standard runtime library memory allocation routines.
But all this is of course really useless to be concerned with if you don't first build a straightforward, bare-minimal, working version of your database, and put it to use in a real application.
Best Answer
What you should worry about is ownership of the data and how long the pointers should/will be valid.
Using std::vector will allow you to be more relaxed about validity though you should still be careful as a reallocation will invalidate the pointer.
However much C++ code is littered with
&vector[0]
to get the pointer to the first element so it's very much accepted to grab a raw pointer to fill in data using a third party library.