I recently discovered (or rather realised how to use) Python's multiple inheritance, and am afraid I'm now using it in cases where it's not a good fit. I want to have some starting data source (NewsCacheDB
,TwitterStream
) that gets transformed in various ways (Vectorize
,SelectKBest
,SelectPercentile
).
I found myself writing the following sort of code (Example 1) (the actual code is a bit more complex but the idea is the same). The point being that for ExperimentA
and ExperimentB
I can define exactly what self.data
is, by just relying on class inheritance. Is this really a useful way of achieving the desired behaviour?
I could also use decorators (Example 2). Using the decorators would be less code.
Which approach is preferable? I'm not looking for arguments of the "I like writing decorators better" kind, but rather arguments about
- readability
- maintainability
- testability
- pythonicity (yes it's a word).
EXAMPLE 1
class NewsCacheDB(object):
"""Play back cached news articles from a database"""
def __init__(self):
super(NewsArticleCache, self).__init__()
@property
def data(self):
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
class TwitterCacheDB(object):
"""Play back cached tweets from a database"""
def __init__(self):
super(TwitterCache, self).__init__()
@property
def data(self):
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
class TwitterStream(object):
def __init__(self):
super(TwitterStream, self).__init__()
@property
def data(self):
# setup access to live twitter stream
while stream.isalive():
yield stream.next()
class Vectorize(object):
"""Turn raw data into numpy vectors"""
def __init__(self):
super(Vectorize, self).__init__()
@property
def data(self):
for item in super(Vectorize, self).data:
transformed = vectorize(item) # slight simplification here
yield transformed
class SelectKBest(object):
"""Select K best features based on some metric"""
def __init__(self):
super(SelectKBest, self).__init__()
@property
def data(self):
for item in super(SelectKBest, self).data:
transformed = select_kbest(item) # slight simplification here
yield transformed
class SelectPercentile(object):
"""Select the top X percentile features based on some metric"""
def __init__(self):
super(SelectPercentile, self).__init__()
@property
def data(self):
for item in super(SelectPercentile, self).data:
transformed = select_kbest(item) # slight simplification here
yield transformed
class ExperimentA(SelectKBest, Vectorize, TwitterCacheDB):
# lots of control code goes here
class ExperimentB(SelectKBest, Vectorize, NewsCacheDB):
# lots of control code goes here
class ExperimentC(SelectPercentile, Vectorize, NewsCacheDB):
# lots of control code goes here
EXAMPLE 2
def multiply(fn):
def wrapped(self):
return fn(self) * 2
return wrapped
def twitter_cacheDB(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
return wrapped
def twitter_live(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while stream.isalive():
yield stream.next() # slight simplification here
return wrapped
def news_cacheDB(fn):
def wrapped(self):
user, pass = fn(self)
# setup access to data base
while db.isalive():
yield db.next() # slight simplification here
return wrapped
def vectorize(fn):
def wrapped(self):
for item in fn():
transformed = do_vectorize(item) # slight simplification here
yield transformed
yield wrapped
def select_kbest(fn):
def wrapped(self):
for item in fn():
transformed = do_selection(item) # slight simplification here
yield transformed
yield wrapped
class ExperimentA():
@property
@select_kbest
@vectorize
@twitter_cacheDB
def a(self):
return 'me','123' # return user and pass to connect to DB
class ExperimentB():
@property
@select_kbest
@vectorize
@news_cacheDB
def a(self):
return 'me','123' # return user and pass to connect to DB
Best Answer
Less code, as long as it's readable is better than more code
From a code size point of view I always go with the solution that requires the least amount of code that is still readable and maintainable. Less code means less chance for defects and less code to maintain.
Multiple Inheritance is not a good choice for Composition
From a design stand point I would not use multiple inheritance the way you describe for the following reasons:
You are changing the way
data
is behaving in the different classes. While it doesn't directly violate the Open/Closed Principle of OO with the initial implementation, any changes in the future have a good chance of modifying the behaviors in one or more locations. You are also relying on behavior pulled throughsuper
which will only works correctly if you have the base classes ordered correctly in the class definition.Relying on the class definition to specify the correct ordering of classes create a fragile system. It's fragile because you can't choose classes that have particular interfaces defined, you actually have to know the implemented logic so the
super
calls get executed in the correct order. It's also an extremely tight coupling as a result. Since it's using class inheritance we also get vertical coupling which basically means there are implicit dependencies not just in individual methods, but potentially between the different layers (classes).Multiple inheritance in any language often has many pitfalls. Python does some work to fix some issues with inheritance, however there are numerous ways of unintentionally confusing the method resolution order (mro) of classes. These pitfalls always exist, and they are also a prime reason to avoid using multiple inheritance.
Alternatives
Alternatively I would leave data source specific logic in the classes (ie. *_CacheDB). Then use either decorator or functional composition to add the generalized logic to automatically apply the transformations.