Python – How to find which attributes the tree splits on, when using scikit-learn

decision-treemachine learningpythonscikit-learn

I have been exploring scikit-learn, making decision trees with both entropy and gini splitting criteria, and exploring the differences.

My question, is how can I "open the hood" and find out exactly which attributes the trees are splitting on at each level, along with their associated information values, so I can see where the two criterion make different choices?

So far, I have explored the 9 methods outlined in the documentation. They don't appear to allow access to this information. But surely this information is accessible? I'm envisioning a list or dict that has entries for node and gain.

Thanks for your help and my apologies if I've missed something completely obvious.

Best Answer

Directly from the documentation ( http://scikit-learn.org/0.12/modules/tree.html ):

from io import StringIO
out = StringIO()
out = tree.export_graphviz(clf, out_file=out)

StringIO module is no longer supported in Python3, instead import io module.

There is also the tree_ attribute in your decision tree object, which allows the direct access to the whole structure.

And you can simply read it

clf.tree_.children_left #array of left children
clf.tree_.children_right #array of right children
clf.tree_.feature #array of nodes splitting feature
clf.tree_.threshold #array of nodes splitting points
clf.tree_.value #array of nodes values

for more details look at the source code of export method

In general you can use the inspect module

from inspect import getmembers
print( getmembers( clf.tree_ ) )

to get all the object's elements

Decision tree visualization from sklearn docs

Related Solutions

Python – How to obtain information gain from a scikit-learn DecisionTreeClassifier

You can only access the information gain (or gini impurity) for a feature that has been used as a split node. The attribute DecisionTreeClassifier.tree_.best_error[i] holds the entropy of the i-th node splitting on feature DecisionTreeClassifier.tree_.feature[i]. If you want the entropy of all examples that reach the i-th node look at DecisionTreeClassifier.tree_.init_error[i].

For more information see the documentation here: https://github.com/scikit-learn/scikit-learn/blob/dacfd8bd5d943cb899ed8cd423aaf11b4f27c186/sklearn/tree/_tree.pyx#L64

If you want to access the entropy for each feature (at a certain split node) - you need to modify the function find_best_split https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L713

Python – How to extract the decision rules from scikit-learn decision-tree

I believe that this answer is more correct than the other answers here:

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print "def tree({}):".format(", ".join(feature_names))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print "{}if {} <= {}:".format(indent, name, threshold)
            recurse(tree_.children_left[node], depth + 1)
            print "{}else:  # if {} > {}".format(indent, name, threshold)
            recurse(tree_.children_right[node], depth + 1)
        else:
            print "{}return {}".format(indent, tree_.value[node])

    recurse(0, 1)

This prints out a valid Python function. Here's an example output for a tree that is trying to return its input, a number between 0 and 10.

def tree(f0):
  if f0 <= 6.0:
    if f0 <= 1.5:
      return [[ 0.]]
    else:  # if f0 > 1.5
      if f0 <= 4.5:
        if f0 <= 3.5:
          return [[ 3.]]
        else:  # if f0 > 3.5
          return [[ 4.]]
      else:  # if f0 > 4.5
        return [[ 5.]]
  else:  # if f0 > 6.0
    if f0 <= 8.5:
      if f0 <= 7.5:
        return [[ 7.]]
      else:  # if f0 > 7.5
        return [[ 8.]]
    else:  # if f0 > 8.5
      return [[ 9.]]

Here are some stumbling blocks that I see in other answers:

Using tree_.threshold == -2 to decide whether a node is a leaf isn't a good idea. What if it's a real decision node with a threshold of -2? Instead, you should look at tree.feature or tree.children_*.
The line features = [feature_names[i] for i in tree_.feature] crashes with my version of sklearn, because some values of tree.tree_.feature are -2 (specifically for leaf nodes).
There is no need to have multiple if statements in the recursive function, just one is fine.

Best Answer

Related Solutions

Python – How to obtain information gain from a scikit-learn DecisionTreeClassifier

Python – How to extract the decision rules from scikit-learn decision-tree

Related Topic