C# server side application 100 GB dataset + Garbage Collection

cgarbage-collection

If I have server with 256 Gb of ram. I was wondering can I create a C# application which has a 100 GB of memory footprint?

I want to create a dictionary like

Dictionary<DateTime,Dictionary<string,Dictionary<string,DataClass>>
Fist dictionary
DateTime - Max 40 distinct values

Second Dictionary
string : 2000 distinct values

Third dictionary
string : 100 distinct values

Object will be removed from the DateTime part (after every say 1 hour), till 1 hour, various internal members are populated.

With Java I am told even if the server has large memory, JVM will stall because of GC.

Best Answer

CLR Limits

32-bit CLR is limited to around 2GB of memory.

With Windows Enterprise edition, you can get to 8TB of virtual memory with 64-bit:

A 64-bit process running on a 64-bit machine can acquire up to 8 TB of virtual memory. (See last paragraph.)

If I have to pick up a particular platform (i.e. CLR) because 100 GB of memory stalls another one (JVM) and I cannot use it for this reason, then I need to take a hard look at my software design, explore all the alternatives and then justify such a decision. (I discuss alternatives in the concluding section).

Performance Analysis Considerations

In addition to CLR/process limits, your question has multiple aspects to it:

  • If there are other processes allocating a large amount of memory (say close to 156GB or more as an example), then 100GB will not be sufficient. Rule no. 1: It depends on how much free memory does your machine have when you are attempting to allocate 100GB.

  • Whether the machine or CLR stalls is a function of how the operating system treats such a large object as well. One example is page file, if it is not suitably configured, the OS may too aggressively swap page file or, on the other hand, page file size may not be sufficient. As a result, you may not only lose the performance you were hoping to achieve by allocating such a large object in memory, but the performance may be worse than if it were in a DB because of the OS behavior. Rule no. 2: It depends on how the OS is configured (or designed) to treat such a large memory footprint.

  • The performance evaluation criteria has to be defined in measurable terms. What does "stall" mean? Does that mean that you are asking whether CLR will crash or not, or whether some functionality in your program (such as user response) is so much seconds, or the impact of your program on machine or other processes is so much. Rule no. 3: Performance criteria should ideally be measurably defined.

These rules lead one to the overarching principle used in performance evaluation: prototype and measure. Without measuring on the specific platform with your specific needs (i.e. the size of objects that are allocated and how garbage collector treats them), it is very hard to predict whether this will work or "stall". In addition, even if your initial results in performance are not up to your requirements, you might be able to tune the system to achieve more.

Metrics

On the metrics front, the maximum number of entries based no your data is about 8 million. Assuming that each entry in your dictionary is on average 1 KB, the total memory that you are looking for values is around 8 GB. However, since pretty much everything in the CLR is an object, there are a number of function pointers that are allocated alongside. So if instead of 8 GB, your actual data is 100 GB, then you are looking at a much larger memory footprint than 100 GB. The difference between 8 GB and 100 GB is significant enough that I would do an in-depth analysis of how much memory is really needed.

Finally, if you intend to create such a large object, you might want to about the Large Object Heap (LOH) in the .NET Framework. Depending on how you allocate the objects that go in the dictionary, whether they are allocated on the LOH or not will vary (e.g. CLR could consider them too many small objects and thus allocate them on Small Object Heap).

Concluding Remarks

If I were doing this, I would consider putting this data on disk or in a DB and measure that, perhaps as a primary option. I see a few issues with the model that you proposed to store such data in memory:

  • What if the server crashes (or needs to be rebooted for maintenance), is it safe to lose the data?
  • If the schema needed to be changed and the data needed to be moved from current object schema that you have to a new one, what does that transition look like (assuming we can't shut the process in the middle).
  • If the data in memory were to be needed (e.g. data is considered to have been corrupted, or needed for a bug investigation), how would someone look at the data in the server. This is much easier if the data is on disk.

Once these are clarified, there may be a follow up set depending on responses to these. (e.g. if it is not OK to lose the data in the case of server crash, then we have the question of backups, etc.)

With a classical DB, a file system, or NoSQL design, these issues are easier to resolve, especially the continuity one (i.e. data cannot be lost in the case of server crash). In addition, it makes the solution platform/language (i.e. could be C#, Java, PHP, Python, whatever) agnostic which is indicative of a better design.