C++ Interface Design – What a Double Container Should Offer

ccoding-standardsstl

I want to write a class which offers two sequences of elements to its users. The first one (lets call it "primary") is the main of the class and will be use 80% of the time. The second one (lets call it "secondary") is less important, but still need to be in the same class. The question is: what interface should the class offer to its users?

By looking at STL style, a class with a single sequence of elements should offer begin() and end() functions for traversal and function like insert() and erase() for modifications.

But how should my class offer the second sequence?

For now, I have two ideas:

Expose the two containers to the user (what about the Law of Demeter ?)
Provide the main container with STL interface and expose only the second one.

Here is an example.

#include <vector>

class A { 
    public:
        std::vector<int>&  primary();
        std::vector<char>& secondary();

    private:
        std::vector<int>  m_primary;
        std::vector<char> m_secondary;
};

class B { 
    public:
        std::vector<int>::iterator begin();
        std::vector<int>::iterator end();
        std::vector<char>& secondary();

    private:
        std::vector<int>  m_primary;
        std::vector<char> m_secondary;
};

// Classes implementation
// ...

int main() {

    // --------------------------------------------------
    // Case 1
    // --------------------------------------------------

    A a;

    for(auto it = a.primary().begin(); it != a.primary().end(); ++it) {
        // ...
    }   
    for(auto it = a.secondary().begin(); it != a.secondary().end(); ++it) {
        // ...
    }   

    // --------------------------------------------------
    // Case 2
    // --------------------------------------------------

    B b;

    for(auto it = b.begin(); it != b.end(); ++it) {
        // ...
    }   
    for(auto it = b.secondary().begin(); it != b.secondary().end(); ++it) {
        // ...
    }   
}

What is the more C++ish way to do that? Is one best than the other or is there an other solution?

Context

This problem came in the context of an exercise in which I am writing a simple database access framework. Here are the classes involved in the question:

table
column
row
field

The table class consists of a sequence of columns and an other sequence of rows. The main use of the table is manipulating (access, add and remove) the rows.

Deeper in the hierarchy, a row is made of column and field so a user can ask the value (field) corresponding to a given column or column name. Each time a column is add/modify/remove from the table, every rows will need to be modify to reflect the modification.

I want the interface to be simple, extensible and combining well with existing code (like STL or Boost).

Best Answer

Don't expose your guts, guide visitors :

class A { 
    public:

        // we assume you want read-only versions, if not you can add non-const versions
        template< class Func >
        void for_each_primary( Func f ) const { for_each_value( f, m_primary ); }

        template< class Func >
        void for_each_secodary( Func f ) const { for_each_value( f, m_secondary ); }

    private:
        std::vector<int>  m_primary;
        std::vector<char> m_secondary;

        template< class Func, class Container >
        void for_each_value( Func f, const Container& c )
        {
             for( auto i : c )
                 f( i );
        }
};



int main()
{
    A a;
    a.for_each_primary( [&]( int value ) 
         { std::cout << "Primary Value : " << value ; } );
    a.for_each_secondary( [&]( int value ) 
         { std::cout << "Secondary Value : " << value ; } )
}

Note that you could use std::function instead of template parameter if you want to put the implementation in a cpp file, making implementation changes less expensive on compilation times in big projects.

Also, I didn't try to compile it now, but I used a lot this pattern in my open-source projects.

This solution is a C++11 enhancement of B I guess.

HOWEVER

This solution have several issues :

It requires C++11 to be effective, because it's efficient for the user of you class ONLY if he can use lambda.
It relies on the fact that the class implementer really know what algorithms precisely are to be available to users. If the user need to do complex manipulations to the numbers, jumping from index to index in an unpredictable way for example, then exposing iterators, a copy of the values OR the values would be better.

In fact, this kind of choice totally depends on what you intend the user to do with this class.

By default I prefer the solution I gave you because it's the most "isolated" one, making sure the class know how it's values can be manipulated by external code. It's a bit like "extensions points". If it's a map, providing a find function to your class is easy. So I think that's the more sane way to expose data and it's also made available by lambdas.

As said, if you need to make sure the user can manipulate the data as he wish, then providing a copy of the container is the next "isolated" option (maybe with a way to reset the container with the copy after that). If a copy would be expensive, then iterators would be better. If not enough then a reference is acceptable but it's certainly a bad idea.

Now assuming you're using C++11 and don't want to provide algorithms, the most idiomatic way is using iterators this way (only the user code changes) :

class B { 
    private:
        std::vector<int>  m_primary;
        std::vector<char> m_secondary;
    public:

        // your code is read-write enabled... make sure it's not const_iterator you want
        // also I'm using decltypt to allow changing container type without having to manually change functions signatures
        decltype(m_primary)::iterator primary_begin() const;
        decltype(m_primary)::iterator primary_end() const;

        decltype(m_secondary)::iterator secondary_begin() const;
        decltype(m_secondary)::iterator secondary_end() const; 


};

int main()
{
    B b;


    std::for_each( b.primary_begin(), b.primary_end(), []( int& value ) {
        // ...
    });   
    std::for_each( b.secondary_begin(), b.secondary_end(), []( double& value ) {
        // ...
    });   

}

Related Solutions

C++ – the best way to store a table in C++

Last time I tried to understand C4.5, I failed, but I have implemented a variant of ID3 - originally out of curiosity, but it eventually got used as part of an overkill multiple-dispatch code generator. This never deals with large data sets, though, and it's a good job. You wouldn't do well to imitate most of what I did, but maybe with a few exceptions, and of course I learned a bit from the mistakes.

I tend to think in terms of building a decision tree for an expert system, so I tend to use the following terms - sorry if that's confusing...

Column = Question ..... A question the expert system might ask
Row    = Conclusion ... A possible conclusion the expert system might reach
Cell   = Answer ....... For the question and conclusion, what answer should
                        the user be expected to give

Actually, in my case, I made the conclusion into another column - similar to a truth table for a logic gate. Row numbers were therefore just row numbers. This allowed me to handle XOR-style problems which can't even be represented if the same conclusion cannot appear on several rows. I'm not sure if this is relevant to you or not. In any case, I'm ignoring this below - it doesn't really make a lot of difference unless until you look at the details of choosing which question to ask next anyway. For data mining, you probably don't have a particular piece of information to treat as the target conclusion anyway - the "conclusion" is just whatever is left when you decide to stop asking questions.

So - for each decision tree node derived so far, you have a set of outstanding questions (columns) and a set of not-yet-eliminated conclusions (rows). That's what I did. The only point worth adding is I used bit-vectors.

IIRC, the C++ std::vector<bool> and std::array<bool> may be implemented as bit-vectors, but you're still reliant on the STL algorithms for set operations, which operate one item at a time. I used my own bit-vector class which has been gradually built up over a period of time, and which uses bitwise operators on the underlying std::vector<CHUNK> (where CHUNK is an unsigned int type, usually 32 bits wide).

There may be a better bit-vector option in C++11 or in Boost, and there must be some good libraries out there some where - there's plenty of kinds of program where you end up working with sets of unsigned integers. I just don't know much about them because I've been too lazy to switch from using my own.

However, bit-vectors are at there best when sets are mostly dense. In this case, the set of rows is the obvious problem. Only the root node of the decision tree will have a perfectly dense row-set. As you get further from the root, the row sets get sparser and sparser, with each question answered resulting in the set of rows been distributed between two or more disjoint next-node row sets.

So a simple sorted-array-of-row-numbers might be the best representation for these sets. However, it's also possible that a "sparse bit-vector" might be worthwhile. One possible implementation is a sorted array of pairs where the first of each pair is the first row-ID of a block and the second is a fixed-size bitvector for that block. For example, the row number 35 might be stored in block 32 (35 & ~(32 - 1)) in bit position 3 (35 & (32 - 1)). If you only save the pairs where the bitvector is non-zero, this gives something between a sorted array of IDs and a simple bitvector - it handles sparse arrays reasonably well, especially when IDs tend to cluster closely together in sets.

Also, it may be worthwhile using a class that can switch from a bitvector to a sorted-array representation when the size gets small enough. The extra complication, purely to benefit a few nodes near the root, is probably pointless though.

Anyway, however these sets are represented, as they refer back to a single constant "database", this saves a lot in data-copying and space-waste as the algorithm runs. But it's still worth looking at that "database".

I used an associative data structure, allowing me to look up using a tuple of question-ID and conclusion-ID to get a answer-ID. That means I had a per-item overhead for the key (question-ID and conclusion-ID) and in this case B+-style tree overhead as well. The reason - basically habit. I have containers that are very flexible, and I tend to use them a lot because it saves on anticipating what capabilities I'll actually need later. There's a price for that, but that's just the old premature optimisation thing.

In your case, you're using a matrix - I assume a two-dimensional array indexed by question-ID and answer-ID.

The only way I can imagine my version being more efficient that yours is if most answers aren't known. In a matrix, you need a special not-known answer ID for that, taking the same space as a known answer-ID. In an associative container, you exclude those rows.

Even so, a sorted array would be more efficient than my B+ tree based solution. You don't need to allow for efficient inserts, so the only necessary overhead is for the keys.

If you use two key fields (question and conclusion, row and column) that might be a problem (I don't really remember) - you maybe can't just keep one copy of the table in one sorted order. But if you use a single computed key along the lines of (row * num_columns) + column, you're basically implementing a two dimensional sparse array anyway.

For me, the presence of unknown/undefined answers for a particular question means I'm not allowed to ask that question yet - and even that was just the theory I used back when I first implemented the algorithm. I never actually put that to use. There's a use I could put it to, but I never got around to it. For the record, in that multiple-dispatch code generator, one idea was to dispatch based on fields in the type. As the type itself is polymorphic, those fields may not even be there, so it's only valid to look at them once you've confirmed that they must be present.

If you don't have an application for unknown/undefined answers, your existing matrix is probably the best solution already.

So basically, that's it - I can't really offer any clearly better options, and what you're doing is probably already better than what I did. However, there are some trade-off possibilities that you might consider - assuming that's not premature (and possibly false) optimisation, of course.

The main trade-off issue relates to the efficiency of representations of sparse vs. dense sets of values, so it isn't really specific to C4.5 or decision-tree building. And a more "sophisticated" approach is often less efficient than a simple one that was chosen with care.

Design Patterns – Choosing Base for Decorator: Interface, Abstract Class, or Non-Abstract

What should be at the top of inheritance tree of Decorator design pattern?

How to discriminate:

non-abstract class - Only if it makes sense in your code to instantiate it in client code (also see: liskov substitution)
abstract class or interface - most common case; This is when it doesn't make sense for client code to instantiate it; To distinguish between abstract class and interface: If you look through your code and find that all specializations have data in common, move it to the base class (otherwise, remain with the interface).
abstract class and interface - if there are two cases when a part of your specializations have common code and a part don't, extract common code to separate class, and you end up with both cases.

My criteria is usually not derived from some pure rules that I respect to implement the decorator; Instead, I try to optimize for maintenance instead of purity.

You can also see that I do not particularly differentiate between abstract classes and interfaces; this is because in C++ there are no interfaces - only abstract classes (so the distinction feels a bit artificial).

Context

Best Answer

Related Solutions

C++ – the best way to store a table in C++

Design Patterns – Choosing Base for Decorator: Interface, Abstract Class, or Non-Abstract

Related Topic