Data Storage Optimization – Efficient Methods for Storing and Querying Millions of Objects

cdictionaryoptimization

This is basically a logging/counting application that is counting the number of packets and counting the type of packet, etc. on a p2p chat network. This equates to about 4-6 million packets in a 5 minute period. And because I only take a "snapshot" of this information, I am only removing packets older than 5 minutes every five minutes. So the maximum about of items that will be in this collection is 10 to 12 million.

Because I need to make 300 connections to different superpeers, it is a possibility that each packet is trying to be inserted at least 300 times (which is probably why holding this data in memory is the only reasonable option).

Currently, I have been using a Dictionary for storing this information. But because of the large amount of items I'm trying to store, I run into issues with the large object heap and the amount of memory usage continuously grows over time.

Dictionary<ulong, Packet>

public class Packet
{
    public ushort RequesterPort;
    public bool IsSearch;
    public string SearchText;
    public bool Flagged;
    public byte PacketType;
    public DateTime TimeStamp;
}

I have tried using mysql, but it was not able to keep up with the amount of data that I need to insert (while checking to make sure it was not a duplicate), and that was while using transactions.

I tried mongodb, but the cpu usage for that was insane and did not keep either.

My main issue arises every 5 minutes, because I remove all packets that are older than 5 minutes, and take a "snapshot" of this data. As i'm using LINQ queries to count the number of packets containing a certain packet type. I also am calling a distinct() query on the data, where I strip 4 bytes (ip address) out of the keyvaluepair's key, and combine it with the requestingport value in the Value of the keyvalupair and use that to get a distinct number of peers from all the packets.

The application currently hovers around 1.1GB of memory usage, and when a snapshot is called it can go so far as to double the usage.

Now this wouldn't be an issue if I have an insane amount of ram, but the vm I have this running on is limited to 2GB of ram at the moment.

Is there any easy solution?

Best Answer

Instead of having one dictionary and searching that dictionary for entries that are too old; have 10 dictionaries. Every 30 seconds or so create a new "current" dictionary and discard the oldest dictionary with no searching at all.

Next, when you're discarding the oldest dictionary, put all the old objects onto a FILO queue for later, and instead of using "new" to create new objects pull an old object off the FILO queue and use a method to reconstruct the old object (unless the queue of old objects is empty). This can avoid a lot of allocations and a lot of garbage collection overhead.

Related Topic