Python – How to obtain information gain from a scikit-learn DecisionTreeClassifier

classificationmachine learningpythonscikit-learn

I see that DecisionTreeClassifier accepts criterion='entropy', which means that it must be using information gain as a criterion for splitting the decision tree.
What I need is the information gain for each feature at the root level, when it is about to split the root node.

Best Answer

You can only access the information gain (or gini impurity) for a feature that has been used as a split node. The attribute DecisionTreeClassifier.tree_.best_error[i] holds the entropy of the i-th node splitting on feature DecisionTreeClassifier.tree_.feature[i]. If you want the entropy of all examples that reach the i-th node look at DecisionTreeClassifier.tree_.init_error[i].

For more information see the documentation here: https://github.com/scikit-learn/scikit-learn/blob/dacfd8bd5d943cb899ed8cd423aaf11b4f27c186/sklearn/tree/_tree.pyx#L64

If you want to access the entropy for each feature (at a certain split node) - you need to modify the function find_best_split https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L713

Related Solutions

Python – Calling a function of a module by using its name (a string)

Assuming module foo with method bar:

import foo
method_to_call = getattr(foo, 'bar')
result = method_to_call()

You could shorten lines 2 and 3 to:

result = getattr(foo, 'bar')()

if that makes more sense for your use case.

You can use getattr in this fashion on class instance bound methods, module-level methods, class methods... the list goes on.

Python – How to execute a program or call a system command

Use the subprocess module in the standard library:

import subprocess
subprocess.run(["ls", "-l"])

The advantage of subprocess.run over os.system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).

Even the documentation for os.system recommends using subprocess instead:

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

On Python 3.4 and earlier, use subprocess.call instead of .run:

subprocess.call(["ls", "-l"])

Best Answer

Related Solutions

Python – Calling a function of a module by using its name (a string)

Python – How to execute a program or call a system command

Related Topic