C# – Optimal way to implement this specific lookup table in C#

cdatabasehashinglinqspeed

I want to create a lookup table for this data:

The "input variables" (what is used to "lookup") are 4 different doubles that can each take on 1 of 200 numbers (the numbers range from 1-1000 but there are only 200 possible numbers that each can be (these 200 possible numbers that each of them can be are known to me)) the doubles are all 2 decimal places. If any one of the four were changed it would slightly change the output variables. There is also 1 integer (enum really) that can take on a value from 1-5.

There is a condition on 3 of the input variables that (1/x + 1/y + 1/z must be less than 1.02). Could this be used in a hashing algorithm?

The "output variables" (what is returned) are ~30 doubles (mostly 2 decimal places but one has 10 decimal places). (will be packaged in an object) ranging from 1 to 1000 (2 decimal places).

I expect there to be ~150 million records.

Should I use a big dictionary and load it into memory then I start the program?

Would a Database and LINQ be best?

Can I use Trees or Hashing in some way to speed it up?

I've never had to make a LUT this big before, where speed is a major factor.

Clarification: Because of the condition that (1/x + 1/y + 1/z must be less than 1.02, see above) there is only ~150 million combinations of input variables. Not ~(200)^4.

Update:

I have looked at some stats (min and max observed values and discovered some relationships) for my input variables and have found that if we call the 4 input doubles A, B, C, D: A and B have ~200 possible values each,
C has ~50 possible values, and
D has ~120 possible values

Of these, there are several relationships that mean that there are only ~27 million combinations of these rather than the ~150 million I originally thought. So there will be ~27 million records in the LUT. There is also definitely a relationship that I haven't been able to figure out between (A, B, C) and D which will also bring down the number of combinations.

Would it be optimal to run the LUT from RAM now that I have reduced the entries from 150 to 27 million (and probably lower)?

Now that it's down to 27 million would storing them as ints by multiplying them by 100 (2 decimal places) still be optimal?

As DocBrown suggested, I should store the doubles as ints (multiply by 100 because they have 2 decimal places) and then combine the 5 different ints (4 doubles (see above) and 1 enum (value:1-5)) into a key for the LUT.

How to do this such that I will have a unique value for each combination of my 5 input variables (the 5 ints) AND this method is open to expansion of each of the input variables, i.e. should I need to expand the double C's combinations to 70 instead of 50 I will need unique key values for the new entries that are result of the expanded number of total combinations of the input variables.

Best Answer

Doubles with 2 decimal places can easily be stored in an integer (just multiply each one by 100). If you map your 5 input variables to an index from 0 to 200 (or 1-5 for the fifth), you will need at most 5 bytes to store them all. For your output records, from what you write I estimate you will need around 4 bytes per value, maybe 8 for the last one, giving you a total of 29*4+8=124 bytes. Add the former 5 bytes, and add some internal overhead, you will need roundabout 150 bytes per record in total. Multiply this with 150 million records shows that you will need at least 22GB to hold the full data in memory. And that won't differ if you are using a hash structure or a tree structure.

If you have a 64 bit machine for your processing at hand with so much main memory, you can try to handle this without a database, but if you are using a typical standard PC, I would tend to use a database for that (at least, nowadays, in 5 or ten years when the "standard PC" comes with >64GB RAM the situation may be different). And don't use the doubles directly for indexing, map them first to integers in the 0,...,200 interval, combine them into a 5 byte integer and use that value as an indexed key.

Related Topic