Last time I tried to understand C4.5, I failed, but I have implemented a variant of ID3 - originally out of curiosity, but it eventually got used as part of an overkill multiple-dispatch code generator. This never deals with large data sets, though, and it's a good job. You wouldn't do well to imitate most of what I did, but maybe with a few exceptions, and of course I learned a bit from the mistakes.
I tend to think in terms of building a decision tree for an expert system, so I tend to use the following terms - sorry if that's confusing...
Column = Question ..... A question the expert system might ask
Row = Conclusion ... A possible conclusion the expert system might reach
Cell = Answer ....... For the question and conclusion, what answer should
the user be expected to give
Actually, in my case, I made the conclusion into another column - similar to a truth table for a logic gate. Row numbers were therefore just row numbers. This allowed me to handle XOR-style problems which can't even be represented if the same conclusion cannot appear on several rows. I'm not sure if this is relevant to you or not. In any case, I'm ignoring this below - it doesn't really make a lot of difference unless until you look at the details of choosing which question to ask next anyway. For data mining, you probably don't have a particular piece of information to treat as the target conclusion anyway - the "conclusion" is just whatever is left when you decide to stop asking questions.
So - for each decision tree node derived so far, you have a set of outstanding questions (columns) and a set of not-yet-eliminated conclusions (rows). That's what I did. The only point worth adding is I used bit-vectors.
IIRC, the C++ std::vector<bool>
and std::array<bool>
may be implemented as bit-vectors, but you're still reliant on the STL algorithms for set operations, which operate one item at a time. I used my own bit-vector class which has been gradually built up over a period of time, and which uses bitwise operators on the underlying std::vector<CHUNK>
(where CHUNK
is an unsigned int type, usually 32 bits wide).
There may be a better bit-vector option in C++11 or in Boost, and there must be some good libraries out there some where - there's plenty of kinds of program where you end up working with sets of unsigned integers. I just don't know much about them because I've been too lazy to switch from using my own.
However, bit-vectors are at there best when sets are mostly dense. In this case, the set of rows is the obvious problem. Only the root node of the decision tree will have a perfectly dense row-set. As you get further from the root, the row sets get sparser and sparser, with each question answered resulting in the set of rows been distributed between two or more disjoint next-node row sets.
So a simple sorted-array-of-row-numbers might be the best representation for these sets. However, it's also possible that a "sparse bit-vector" might be worthwhile. One possible implementation is a sorted array of pairs where the first of each pair is the first row-ID of a block and the second is a fixed-size bitvector for that block. For example, the row number 35 might be stored in block 32 (35 & ~(32 - 1)
) in bit position 3 (35 & (32 - 1)
). If you only save the pairs where the bitvector is non-zero, this gives something between a sorted array of IDs and a simple bitvector - it handles sparse arrays reasonably well, especially when IDs tend to cluster closely together in sets.
Also, it may be worthwhile using a class that can switch from a bitvector to a sorted-array representation when the size gets small enough. The extra complication, purely to benefit a few nodes near the root, is probably pointless though.
Anyway, however these sets are represented, as they refer back to a single constant "database", this saves a lot in data-copying and space-waste as the algorithm runs. But it's still worth looking at that "database".
I used an associative data structure, allowing me to look up using a tuple of question-ID and conclusion-ID to get a answer-ID. That means I had a per-item overhead for the key (question-ID and conclusion-ID) and in this case B+-style tree overhead as well. The reason - basically habit. I have containers that are very flexible, and I tend to use them a lot because it saves on anticipating what capabilities I'll actually need later. There's a price for that, but that's just the old premature optimisation thing.
In your case, you're using a matrix - I assume a two-dimensional array indexed by question-ID and answer-ID.
The only way I can imagine my version being more efficient that yours is if most answers aren't known. In a matrix, you need a special not-known answer ID for that, taking the same space as a known answer-ID. In an associative container, you exclude those rows.
Even so, a sorted array would be more efficient than my B+ tree based solution. You don't need to allow for efficient inserts, so the only necessary overhead is for the keys.
If you use two key fields (question and conclusion, row and column) that might be a problem (I don't really remember) - you maybe can't just keep one copy of the table in one sorted order. But if you use a single computed key along the lines of (row * num_columns) + column
, you're basically implementing a two dimensional sparse array anyway.
For me, the presence of unknown/undefined answers for a particular question means I'm not allowed to ask that question yet - and even that was just the theory I used back when I first implemented the algorithm. I never actually put that to use. There's a use I could put it to, but I never got around to it. For the record, in that multiple-dispatch code generator, one idea was to dispatch based on fields in the type. As the type itself is polymorphic, those fields may not even be there, so it's only valid to look at them once you've confirmed that they must be present.
If you don't have an application for unknown/undefined answers, your existing matrix is probably the best solution already.
So basically, that's it - I can't really offer any clearly better options, and what you're doing is probably already better than what I did. However, there are some trade-off possibilities that you might consider - assuming that's not premature (and possibly false) optimisation, of course.
The main trade-off issue relates to the efficiency of representations of sparse vs. dense sets of values, so it isn't really specific to C4.5 or decision-tree building. And a more "sophisticated" approach is often less efficient than a simple one that was chosen with care.
Best Answer
Don't expose your guts, guide visitors :
Note that you could use std::function instead of template parameter if you want to put the implementation in a cpp file, making implementation changes less expensive on compilation times in big projects.
Also, I didn't try to compile it now, but I used a lot this pattern in my open-source projects.
This solution is a C++11 enhancement of B I guess.
HOWEVER
This solution have several issues :
In fact, this kind of choice totally depends on what you intend the user to do with this class.
By default I prefer the solution I gave you because it's the most "isolated" one, making sure the class know how it's values can be manipulated by external code. It's a bit like "extensions points". If it's a map, providing a find function to your class is easy. So I think that's the more sane way to expose data and it's also made available by lambdas.
As said, if you need to make sure the user can manipulate the data as he wish, then providing a copy of the container is the next "isolated" option (maybe with a way to reset the container with the copy after that). If a copy would be expensive, then iterators would be better. If not enough then a reference is acceptable but it's certainly a bad idea.
Now assuming you're using C++11 and don't want to provide algorithms, the most idiomatic way is using iterators this way (only the user code changes) :