Data Structures – Most Efficient Way to Store Data (CSV, Flat Files, Tree Structure)

data structures

My data:

  • Size – 15+ Terabytes of text.
  • Data: Maximum of 1300 rows (always same row headers) , Maximum of 6 columns
  • The column headers will be different but the value are all percent values that will be between 0 – 100. The total of the columns in one row will add up to 100% always. So we can get rid of one column right away.

Can I keep this as a flat file system?
This will be a tree structure, so I don’t think there is any value in MySQL. Maybe NoSQL. Once the data is there, it never has to be updated. It’s just read.
Example:

Item, action1, action2, action3
A, 20, 40, 60
B, 20, 30, 50
...
ZZZ 50, 50, 0

Usage:

The data will be read on a site. 500 users maximum (most likely under 100) and they will all be reading different files, rarely the same one at the same time.
When reading a file all the info is used, so there is no value in being able to only read one row. We will always load all info from the file. The info is only read, never updated.

Questions

  1. Best way to store this info?
  2. Each node (aka csv file) in the tree is in its own folder, with sub folders representing the children node in it too. Going all the way down the tree. Is this a massive waste of space? Not sure of a better way to do it if it is.

I'm going to be creating a 15+ TB of data. Very easily will grow to 50tb. I really want to make sure I do this correctly from the beginning. Any advice at all from someone experienced in this can save me a massive amount of time and future headaches. Even hints that will point me in the correct direction I was going to use something like MongoDB but I am starting to think that maybe keeping it as a flat file system won't be bad since I always load the full file anyway and the folder structure keeps track of the tree. No one will need to jump to some node really far down the line.

This is a massive game-tree (more like this). So we start at the root. The root will be a folder called root and have a csv file like above. This folder will have sub folders that represent the children also with csv files like above and folder representing their children. And so on… So to get info you click down the tree. There will be code that just reads the sub folders and creates links to the next nodes. (One game tree will have 25-100k nodes… so that means 25-100k folders which i think may be a waste of space.)

Best Answer

File systems are very well optimised, so flat files might work very well. Since it is a game tree, I imagine each user is walking the tree, choosing a child node each time.

You haven't mentioned the request rate. If it is high, you will want an SSD, which is great for write once read many.

Given that a user always loads the entire file, it would be good to ensure that the file fits in a single block in the file system. The simplest way is to make a custom volume if necessary. This might also push you to use a binary encoding of the file rather than ASCII. Given that the row labels are constant you don't need to store them at all.

Are the values in each file arbitrary? Or can then be calculated based on the values in the parent? If so, it might be better to recalculate on demand - the instruction path length for I/O is quite long so you can do quite a lot of computation.