C++ – the best way to store a table in C++

cdata structures

I'm programming a decision tree in C++ using a slightly modified version of the C4.5 algorithm. Each node represents an attribute or a column of your data set and it has a children per possible value of the attribute.

My problem is how to store the training data set having in mind that I have to use a subset for each node so I need a quick way to only select a subset of rows and columns.

The main goal is to do it in the most memory and time efficient possible (in that order of priority).

The best way I have thought of is to have an array of arrays (or std::vector), or something like that, and for each node have a list (array, vector, etc) or something with the column,line(probably a tuple) pairs that are valid for that node.

I now there should be a better way to do this, any suggestions?

UPDATE: What I need is something like this:

In the beginning I have this data:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

But for the second node I just need this data:

Paris    4    5.0
New York 7    1.3
Paris    9    6.8

And for the third node:

Tokio    2    9.1
Tokio    0    8.4

But with a table of millions of records with up to hundreds of columns.

What I have in mind is keep all the data in a matrix, and then for each node keep the info of the current columns and rows. Something like this:

Paris    4    5.0    True
New York 7    1.3    True
Tokio    2    9.1    False
Paris    9    6.8    True
Tokio    0    8.4    False

Node 2:

columns = [0,1,2]
rows = [0,1,3]

Node 3:

columns = [0,1,2]
rows = [2,4]

This way on the worst case scenario I just have to waste

size_of(int) * (number_of_columns + number_of_rows) * node

That is a lot less than having an independent data matrix for each node.

Best Answer

Last time I tried to understand C4.5, I failed, but I have implemented a variant of ID3 - originally out of curiosity, but it eventually got used as part of an overkill multiple-dispatch code generator. This never deals with large data sets, though, and it's a good job. You wouldn't do well to imitate most of what I did, but maybe with a few exceptions, and of course I learned a bit from the mistakes.

I tend to think in terms of building a decision tree for an expert system, so I tend to use the following terms - sorry if that's confusing...

Column = Question ..... A question the expert system might ask
Row    = Conclusion ... A possible conclusion the expert system might reach
Cell   = Answer ....... For the question and conclusion, what answer should
                        the user be expected to give

Actually, in my case, I made the conclusion into another column - similar to a truth table for a logic gate. Row numbers were therefore just row numbers. This allowed me to handle XOR-style problems which can't even be represented if the same conclusion cannot appear on several rows. I'm not sure if this is relevant to you or not. In any case, I'm ignoring this below - it doesn't really make a lot of difference unless until you look at the details of choosing which question to ask next anyway. For data mining, you probably don't have a particular piece of information to treat as the target conclusion anyway - the "conclusion" is just whatever is left when you decide to stop asking questions.

So - for each decision tree node derived so far, you have a set of outstanding questions (columns) and a set of not-yet-eliminated conclusions (rows). That's what I did. The only point worth adding is I used bit-vectors.

IIRC, the C++ std::vector<bool> and std::array<bool> may be implemented as bit-vectors, but you're still reliant on the STL algorithms for set operations, which operate one item at a time. I used my own bit-vector class which has been gradually built up over a period of time, and which uses bitwise operators on the underlying std::vector<CHUNK> (where CHUNK is an unsigned int type, usually 32 bits wide).

There may be a better bit-vector option in C++11 or in Boost, and there must be some good libraries out there some where - there's plenty of kinds of program where you end up working with sets of unsigned integers. I just don't know much about them because I've been too lazy to switch from using my own.

However, bit-vectors are at there best when sets are mostly dense. In this case, the set of rows is the obvious problem. Only the root node of the decision tree will have a perfectly dense row-set. As you get further from the root, the row sets get sparser and sparser, with each question answered resulting in the set of rows been distributed between two or more disjoint next-node row sets.

So a simple sorted-array-of-row-numbers might be the best representation for these sets. However, it's also possible that a "sparse bit-vector" might be worthwhile. One possible implementation is a sorted array of pairs where the first of each pair is the first row-ID of a block and the second is a fixed-size bitvector for that block. For example, the row number 35 might be stored in block 32 (35 & ~(32 - 1)) in bit position 3 (35 & (32 - 1)). If you only save the pairs where the bitvector is non-zero, this gives something between a sorted array of IDs and a simple bitvector - it handles sparse arrays reasonably well, especially when IDs tend to cluster closely together in sets.

Also, it may be worthwhile using a class that can switch from a bitvector to a sorted-array representation when the size gets small enough. The extra complication, purely to benefit a few nodes near the root, is probably pointless though.

Anyway, however these sets are represented, as they refer back to a single constant "database", this saves a lot in data-copying and space-waste as the algorithm runs. But it's still worth looking at that "database".

I used an associative data structure, allowing me to look up using a tuple of question-ID and conclusion-ID to get a answer-ID. That means I had a per-item overhead for the key (question-ID and conclusion-ID) and in this case B+-style tree overhead as well. The reason - basically habit. I have containers that are very flexible, and I tend to use them a lot because it saves on anticipating what capabilities I'll actually need later. There's a price for that, but that's just the old premature optimisation thing.

In your case, you're using a matrix - I assume a two-dimensional array indexed by question-ID and answer-ID.

The only way I can imagine my version being more efficient that yours is if most answers aren't known. In a matrix, you need a special not-known answer ID for that, taking the same space as a known answer-ID. In an associative container, you exclude those rows.

Even so, a sorted array would be more efficient than my B+ tree based solution. You don't need to allow for efficient inserts, so the only necessary overhead is for the keys.

If you use two key fields (question and conclusion, row and column) that might be a problem (I don't really remember) - you maybe can't just keep one copy of the table in one sorted order. But if you use a single computed key along the lines of (row * num_columns) + column, you're basically implementing a two dimensional sparse array anyway.

For me, the presence of unknown/undefined answers for a particular question means I'm not allowed to ask that question yet - and even that was just the theory I used back when I first implemented the algorithm. I never actually put that to use. There's a use I could put it to, but I never got around to it. For the record, in that multiple-dispatch code generator, one idea was to dispatch based on fields in the type. As the type itself is polymorphic, those fields may not even be there, so it's only valid to look at them once you've confirmed that they must be present.

If you don't have an application for unknown/undefined answers, your existing matrix is probably the best solution already.

So basically, that's it - I can't really offer any clearly better options, and what you're doing is probably already better than what I did. However, there are some trade-off possibilities that you might consider - assuming that's not premature (and possibly false) optimisation, of course.

The main trade-off issue relates to the efficiency of representations of sparse vs. dense sets of values, so it isn't really specific to C4.5 or decision-tree building. And a more "sophisticated" approach is often less efficient than a simple one that was chosen with care.

Related Topic