Last time I tried to understand C4.5, I failed, but I have implemented a variant of ID3 - originally out of curiosity, but it eventually got used as part of an overkill multiple-dispatch code generator. This never deals with large data sets, though, and it's a good job. You wouldn't do well to imitate most of what I did, but maybe with a few exceptions, and of course I learned a bit from the mistakes.
I tend to think in terms of building a decision tree for an expert system, so I tend to use the following terms - sorry if that's confusing...
Column = Question ..... A question the expert system might ask
Row = Conclusion ... A possible conclusion the expert system might reach
Cell = Answer ....... For the question and conclusion, what answer should
the user be expected to give
Actually, in my case, I made the conclusion into another column - similar to a truth table for a logic gate. Row numbers were therefore just row numbers. This allowed me to handle XOR-style problems which can't even be represented if the same conclusion cannot appear on several rows. I'm not sure if this is relevant to you or not. In any case, I'm ignoring this below - it doesn't really make a lot of difference unless until you look at the details of choosing which question to ask next anyway. For data mining, you probably don't have a particular piece of information to treat as the target conclusion anyway - the "conclusion" is just whatever is left when you decide to stop asking questions.
So - for each decision tree node derived so far, you have a set of outstanding questions (columns) and a set of not-yet-eliminated conclusions (rows). That's what I did. The only point worth adding is I used bit-vectors.
IIRC, the C++ std::vector<bool>
and std::array<bool>
may be implemented as bit-vectors, but you're still reliant on the STL algorithms for set operations, which operate one item at a time. I used my own bit-vector class which has been gradually built up over a period of time, and which uses bitwise operators on the underlying std::vector<CHUNK>
(where CHUNK
is an unsigned int type, usually 32 bits wide).
There may be a better bit-vector option in C++11 or in Boost, and there must be some good libraries out there some where - there's plenty of kinds of program where you end up working with sets of unsigned integers. I just don't know much about them because I've been too lazy to switch from using my own.
However, bit-vectors are at there best when sets are mostly dense. In this case, the set of rows is the obvious problem. Only the root node of the decision tree will have a perfectly dense row-set. As you get further from the root, the row sets get sparser and sparser, with each question answered resulting in the set of rows been distributed between two or more disjoint next-node row sets.
So a simple sorted-array-of-row-numbers might be the best representation for these sets. However, it's also possible that a "sparse bit-vector" might be worthwhile. One possible implementation is a sorted array of pairs where the first of each pair is the first row-ID of a block and the second is a fixed-size bitvector for that block. For example, the row number 35 might be stored in block 32 (35 & ~(32 - 1)
) in bit position 3 (35 & (32 - 1)
). If you only save the pairs where the bitvector is non-zero, this gives something between a sorted array of IDs and a simple bitvector - it handles sparse arrays reasonably well, especially when IDs tend to cluster closely together in sets.
Also, it may be worthwhile using a class that can switch from a bitvector to a sorted-array representation when the size gets small enough. The extra complication, purely to benefit a few nodes near the root, is probably pointless though.
Anyway, however these sets are represented, as they refer back to a single constant "database", this saves a lot in data-copying and space-waste as the algorithm runs. But it's still worth looking at that "database".
I used an associative data structure, allowing me to look up using a tuple of question-ID and conclusion-ID to get a answer-ID. That means I had a per-item overhead for the key (question-ID and conclusion-ID) and in this case B+-style tree overhead as well. The reason - basically habit. I have containers that are very flexible, and I tend to use them a lot because it saves on anticipating what capabilities I'll actually need later. There's a price for that, but that's just the old premature optimisation thing.
In your case, you're using a matrix - I assume a two-dimensional array indexed by question-ID and answer-ID.
The only way I can imagine my version being more efficient that yours is if most answers aren't known. In a matrix, you need a special not-known answer ID for that, taking the same space as a known answer-ID. In an associative container, you exclude those rows.
Even so, a sorted array would be more efficient than my B+ tree based solution. You don't need to allow for efficient inserts, so the only necessary overhead is for the keys.
If you use two key fields (question and conclusion, row and column) that might be a problem (I don't really remember) - you maybe can't just keep one copy of the table in one sorted order. But if you use a single computed key along the lines of (row * num_columns) + column
, you're basically implementing a two dimensional sparse array anyway.
For me, the presence of unknown/undefined answers for a particular question means I'm not allowed to ask that question yet - and even that was just the theory I used back when I first implemented the algorithm. I never actually put that to use. There's a use I could put it to, but I never got around to it. For the record, in that multiple-dispatch code generator, one idea was to dispatch based on fields in the type. As the type itself is polymorphic, those fields may not even be there, so it's only valid to look at them once you've confirmed that they must be present.
If you don't have an application for unknown/undefined answers, your existing matrix is probably the best solution already.
So basically, that's it - I can't really offer any clearly better options, and what you're doing is probably already better than what I did. However, there are some trade-off possibilities that you might consider - assuming that's not premature (and possibly false) optimisation, of course.
The main trade-off issue relates to the efficiency of representations of sparse vs. dense sets of values, so it isn't really specific to C4.5 or decision-tree building. And a more "sophisticated" approach is often less efficient than a simple one that was chosen with care.
Best Answer
Data-Oriented Mindset
Data-oriented design doesn't mean apply SoAs everywhere. It simply means designing architectures with a predominant focus on data representation -- specifically with a focus on efficient memory layout and memory access.
That could possibly lead to SoA reps when appropriate like so:
... this is often suitable for vertical loopy logic that doesn't process a sphere center vector components and radius simultaneously (the four fields aren't simultaneously hot), but instead one at a time (a loop through radius, another 3 loops through individual components of sphere centers).
In other cases it might be more appropriate to use an AoS if the fields are frequently accessed together (if your loopy logic is iterating through all the fields of balls rather than individually) and/or if random-access of a ball is needed:
... in other cases it might be suitable to use a hybrid which balances both benefits:
... you might even compress the size of a ball to half using half-floats to fit more ball fields into a cache line/page.
... perhaps even the radius is not accessed nearly as often as the sphere center (perhaps your codebase often treats them like points and only rarely as spheres, e.g.). In that case, you might apply a hot/cold field splitting technique further.
The key to a data-oriented design is to consider all of these kinds of representations early in making your design decisions, to not trap yourself into a sub-optimal representation with a public interface behind it.
It puts a spotlight on memory access patterns and accompanying layouts, making them a significantly stronger concern than usual. In a sense it may even somewhat tear down abstractions. I've found by applying this mindset more that I no longer look at
std::deque
, e.g., in terms of its algorithmic requirements quite as much as the aggregated contiguous blocks representation it has and how random-access of it works at the memory level. It is somewhat putting a focus on implementation details, but implementation details which tend to have just as much or more of an impact on performance as the algorithmic complexity describing scalability.Premature Optimization
A lot of the predominant focus of data-oriented design will appear, at least at a glance, as dangerously close to premature optimization. Experience often teaches us that such micro-optimizations are best applied in hindsight, and with a profiler in hand.
Yet perhaps a strong message to take from data-oriented design is to leave room for such optimizations. That's what a data-oriented mindset can help allow:
Granular Object-Oriented Design
A lot of data-oriented design discussions will pit themselves against classical notions of object-oriented programming. Yet I would offer a way of looking at this which is not quite as hardcore as to dismiss OOP entirely.
As an exaggerated example, imagine an object-oriented design mindset applied to a single pixel of an image.
Hopefully no one actually does this. To make the example really gross, I stored a back pointer to the image containing the pixel so that it can access neighboring pixels for image processing algorithms like blur.
The image back pointer immediately adds a glaring overhead, but even if we excluded it (making only the public interface of pixel provide operations that apply to a single pixel), we end up with a class just to represent a pixel.
Now there's nothing wrong with a class in the immediate overhead sense in a C++ context besides this back pointer. Optimizing C++ compilers are great at taking all the structure we built and obliterating it down to smithereens.
The difficulty here is that we're modeling an encapsulated interface at too granular of a pixel level. That leaves us trapped with this kind of granular design and data, with potentially a vast number of client dependencies coupling them to this
Pixel
interface.Solution: obliterate away the object-oriented structure of a granular pixel, and start modeling your interfaces at a coarser level dealing with a bulk number of pixels (at the image level).
By modeling at the bulk image level, we have significantly more room to optimize. We can, for example, represent large images as coalesced tiles of 16x16 pixels which perfectly fit into a 64-byte cache line but allow efficient neighboring vertical access of pixels with a typically-small stride (if we have a number of image processing algorithms which need to access neighboring pixels in a vertical fashion) as a hardcore data-oriented example.
Designing at a Coarser Level
The above example of modeling interfaces at an image level is kind of a no-brainer example as image processing is a very mature field that's been studied and optimized to death. Yet less obvious might be a particle in a particle emitter, a sprite vs. a collection of sprites, an edge in a graph of edges, or even a person vs. a collection of people.
The key to allowing data-oriented optimizations (in foresight or hindsight) is often going to boil down to designing interfaces at a much coarser level, in bulk. The idea of designing interfaces for single entities becomes replaced by designing for collections of entities with big operations that process them in bulk. This especially and immediately targets sequential access loops that need to access everything and cannot help but have linear complexity.
Data-oriented design often begins with the idea of coalescing data to form aggregates modeling data in bulk. A similar mindset echoes to the interface designs that accompany it.
This is the most valuable lesson I've taken from data-oriented design, since I'm not computer architecture-savvy enough to often find the most optimal memory layout for something on my first try. It becomes something I iterate towards with a profiler in hand (and sometimes with a few misses along the way where I failed to speed things up). Yet the interface design aspect of data-oriented design is what leaves me room to seek more and more efficient data representations.
The key is to design interfaces at a coarser level than we're usually tempted to do. This also often has side benefits like mitigating the dynamic dispatch overhead associated with virtual functions, function pointer calls, dylib calls and the inability for those to be inlined. The main idea to take out of all of this is to look at processing in a bulk fashion (when applicable).