I'm wondering what the best data structure (for storing data on disk) is for storing immutable time-series data (99% of the data is truly immutable, the 1% is metadata that is separate from the immutable data). I've been looking at log-structured merge-trees in particular because of it's heavy use by Cassandra and the like.
Database – Immutable Data Structure For Time Series Data
data structuresdatabasedatabase-designimmutabilitytheory
Related Solutions
In my opinion, your rule is a good one (or at least it's not a bad one), but only because of the situation you are describing. I wouldn't say that I agree with it in all situations, so, from the standpoint of my inner pedant, I'd have to say your rule is technically too broad.
Typically you wouldn't define immutable objects unless they are essentially being used as data transfer objects (DTO), which means that they contain data properties but very little logic and no dependencies. If that is the case, as it seems it is here, I'd say you are safe to use the concrete types directly rather than interfaces.
I'm sure there will be some unit-testing purists who will disagree, but in my opinion, DTO classes can be safely excluded from unit-testing and dependency-injection requirements. There is no need to use a factory to create a DTO, since it has no dependencies. If everything creates the DTOs directly as needed, then there's really no way to inject a different type anyway, so there's no need for an interface. And since they contain no logic, there's nothing to unit-test. Even if they do contain some logic, as long as they have no dependencies, then it should be trivial to unit-test the logic, if necessary.
As such, I think that making a rule that all DTO classes shall not implement an interface, while potentially unnecessary, is not going to hurt your software design. Since you have this requirement that the data needs to be immutable, and you cannot enforce that via an interface, then I would say it's totally legitimate to establish that rule as a coding standard.
The larger issue, though, is the need to strictly enforce a clean DTO layer. As long as your interface-less immutable classes only exist in the DTO layer, and your DTO layer remains free of logic and dependencies, then you will be safe. If you start mixing your layers, though, and you have interface-less classes that double as business layer classes, then I think you will start to run into much trouble.
NoSQL
For your raw transactions, if the data you are getting is not required to follow a specific format, NoSQL may be a good way to store the data. If it is capable of being stored in a relational data model easily (tables and columns), there are significant speed advantages to using a relational database.
How to do efficient calculations
Aggregate based on a time period
One of the fastest things you can do is to aggregate your data over a time period, storing the results.
Let's assume that your real time feeds are by the minute. That would be 1,440 data points per day, 10,080 per week, approximately 43,000 per month, or 525,600 per year. Now, cut that down to one record per day. That would be a savings of 525,236 records per year. It is also over half a million rows that you don't have to aggregate every time you need to perform a "real time" calculation. You will have to determine what the smallest time period is for a meaningful aggregation that isn't excessively large or small.
A concrete example will help to illustrate this principle: stock market trades by ticker symbol. Every trade comes across the wire with the ticker symbol, the time of the trade, price per share, and quantity of shares traded. There may be other data in the feed, but this will suffice for the example. Knowing the high (max), low (min), and average (mean) of all the trades by day will show certain types of trends. It is not necessary to see the individual transactions when you are looking for a trend that is days, months, or even years in the making. Once you have aggregated that data for the previous day's raw transactions, you don't ever need to recalculate those items again. Store the aggregated data in another table. Then if you want to determine an average (mean) price for a specific ticker over a period of time (week, month, quarter, etc.), it is a fairly simple SQL statement with a date range.
Weighted aggregation
Another very fast method you can use is commonly known as weighted averages.
Simply put, this is a running aggregation where you have a your aggregated value multiplied by a weight (often a total quantity), add your new value, and then divide by the weight + the new weight. This is easier than it sounds.
If you have an average price per share for the day (going back to the stock market example above), you can average the price by transaction (sum(price) / number of transactions). This would be one method for performing the averaging.
To make the average of a weighted average equal that of a standard average, you have to multiply the current average by the number of transactions that have already completed.
The weighted average method will have you keep a running average of the price and the number of transactions. When a new transaction comes in, the previous average price is multiplied by the number of transactions, added to the new transaction price, and finally divided by the number of transactions + 1. If you had 10 transactions at an average price of $20 and a new trade comes in with a price of $24, your weight average would be $20.37 ((10 <-- number of transactions * 20 <-- average price ) + 24 <-- new trade price ) / 11 <-- new number of transactions. When the next trade comes in at $25, your weighted average would be $20.76 ((11 <-- number of transactions * 20.37 <-- average price ) + 25 <-- new trade price ) / 12 <-- new number of transactions.
Best Answer
I do not really see what immutable has to do with this.
You simply store it in the normal way and choose not to update it.
Your problem seems to be how to deal with a high insert rate, which unless you are google, amazon or facebook will be easily handled by any modern database.