I've been tasked to query 2 large csv files which are roughly 1 GB in size each. The files contain related data, so file one may contain a list of order numbers, Order Dates, etc and the file may contain the order lines, stock codes, quantity, etc.
Other Restrictions:
- No Access to a database – so I can't just bulk insert and query
- Must be able to add new queries at a later point
The kind of query I need to write is Sum all the lines grouped by Stock Code or How many orders were there between two dates.
My initial idea is to create two models representing the spreadsheet and then import the lot and query via Linq. However this may cause an issue due to the size of the data (and i can't use a database for lazy loading, etc). Also the querying of the models maybe quite slow.
The other idea I had was to create an import routine that evaluates the data line by line i.e. if I need to count how many stock codes there are in a file I can simply just query the one column in the spreadsheet on a line by line basis and then count them. This would mean that i would have to write a csv import and some sort of query class.
Any thoughts or ideas on how to handle this or the best approach? Is there a pattern for this already?
Best Answer
I do not know which, from the other answers, will work best for you.
AFAICT, I'd remark, mostly from an implementation standpoint, that they all make sense, but from a design perspective, which one will scale best, will be best maintainable, etc, it all depends on other factors you may have left out from your question as-is.
I can think of two.
1) the actual shape of the flat data you have already on hand (independently of the volume of it)
2) whether or not you have the comprehensive knowledge of a bounded, predictible set of the query criteria your application will need.
For instance, if one can make two more strong assumptions about that (and which you only know about) -- and beware, they are indeed very strong:
a) if we assume the answer to (2) is "yes";
and
b) the flat CSV is already sorted on: order id > order line, with order date strictly increasing after the first (order id).
Then radarbob's answer might be the best -- where all you'd need would be a fast line-by-line CSV reader -- implementing, e.g., IEnumerable < RawOrderLine > -- with little code to write, easy to maintain, and with no other dependencies.
E.g.:
(much hypothetical & untested code)
get all orders for given stock code and given date:
API:
How to:
get all orders for given stock code and falling in given date range:
API:
How to:
get sum of quantities for given stock code of all orders falling in given date range:
API:
How to:
Etc, etc.
For a quick and dirty full sample, this below will generate exactly 3 millions random orders into the future, spaced by 1 to 2 minutes from each other (cf. OrderDate), and with 1 to 10 order lines each (of distinct product / quantity pairs) -- resulting in a ~ 500MB file (if in ASCII/ANSI encoding anyway).
The two sample queries (over the last 3 days of 2016, for "Product9") take only a couple seconds on my end (machine rather sluggish).
(the main thing which is annoyingly slow is to spit out that random sample data, in the first place)
'Hope this helps.