Python – How to set a threshold for a sklearn classifier based on ROC results

classificationpythonrocscikit-learnthreshold

I trained an ExtraTreesClassifier (gini index) using scikit-learn and it suits my needs fairly. Not so good accuracy, but using a 10-fold cross validation, AUC is 0.95. I would like to use this classifier on my work. I am quite new to ML, so please forgive me if I'm asking you something conceptually wrong.

I plotted some ROC curves, and by it, its seems I have a specific threshold where my classifier starts performing well. I'd like to set this value on the fitted classifier, so everytime I'd call predict, the classifiers use that threshold and I could believe in the FP and TP rates.

I also came to this post (scikit .predict() default threshold), where its stated that a threshold is not a generic concept for classifiers. But since the ExtraTreesClassifier has the method predict_proba, and the ROC curve is also related to thresdholds definition, it seems to me I should be available to specify it.

I did not find any parameter, nor any class/interface to use to do it. How can I set a threshold for it for a trained ExtraTreesClassifier (or any other one) using scikit-learn?

Many Thanks,
Colis

Best Answer

This is what I have done:

model = SomeSklearnModel()
model.fit(X_train, y_train)
predict = model.predict(X_test)
predict_probabilities = model.predict_proba(X_test)
fpr, tpr, _ = roc_curve(y_test, predict_probabilities)

However, I am annoyed that predict chooses a threshold corresponding to 0.4% of true positives (false positives are zero). The ROC curve shows a threshold I like better for my problem where the true positives are approximately 20% (false positive around 4%). I then scan the predict_probabilities to find what probability value corresponds to my favourite ROC point. In my case this probability is 0.21. Then I create my own predict array:

predict_mine = np.where(rf_predict_probabilities > 0.21, 1, 0)

and there you go:

confusion_matrix(y_test, predict_mine)

returns what I wanted:

array([[6927,  309],
       [ 621,  121]])

Related Solutions

System call and context switch

You need to understand that a thread/process context has multiple parts, one, directly associated with execution and is held in the CPU and certain system tables in memory that the CPU uses (e.g. page tables), and the other, which is needed for the OS, for bookkeeping (think of the various IDs, handles, special OS-specific permissions, network connections and such).

A full context switch would involve swapping both of these, the old current thread/process goes away for a while and the new current thread/process comes in for a while. That's the essence of thread/process scheduling.

Now, system calls are very different w.r.t. each other.

Consider something simple, for example, the system call for requesting the current date and time. The CPU switches from the user to kernel mode, preserving the user-mode register values, executes some kernel code to get the necessary data, stores it either in the memory or registers that the caller can access, restores the user-mode register values and returns. There's not much of context switch in here, only what's needed for the transition between the modes, user and kernel.

Consider now a system call that involves blocking of the caller until some event or availability of data. Manipulating mutexes and reading files would be examples of such system calls. In this case the kernel is forced to save the full context of the caller, mark it as blocked so the scheduler can't run it until that event or data arrives, and load the context of another ready thread/process, so it can run.

That's how system calls are related to context switches.

Kernel executing in the context of a user or a process means that whenever the kernel does work on behalf of a certain process or user it has to take into consideration that user's/process's context, e.g. the current process/thread/user ID, the current directory, locale, access permissions for various resources (e.g. files), all that stuff, that can be different between different processes/threads/users.

If processes have individual address spaces, the address spaces is also part of the process context. So, when the kernel needs to access memory of a process (to read/write file data or network packets), it has to have access to the process' address space, IOW, it has to be in its context (it doesn't mean, however, that the kernel has to load the full context just to access memory in a specific address space).

Is that helpful?

Python – How to select rows from a DataFrame based on column values

To select rows whose column value equals a scalar, some_value, use ==:

df.loc[df['column_name'] == some_value]

To select rows whose column value is in an iterable, some_values, use isin:

df.loc[df['column_name'].isin(some_values)]

Combine multiple conditions with &:

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Note the parentheses. Due to Python's operator precedence rules, & binds more tightly than <= and >=. Thus, the parentheses in the last example are necessary. Without the parentheses

df['column_name'] >= A & df['column_name'] <= B

is parsed as

df['column_name'] >= (A & df['column_name']) <= B

which results in a Truth value of a Series is ambiguous error.

To select rows whose column value does not equal some_value, use !=:

df.loc[df['column_name'] != some_value]

isin returns a boolean Series, so to select rows whose value is not in some_values, negate the boolean Series using ~:

df.loc[~df['column_name'].isin(some_values)]

For example,

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
                   'B': 'one one two three two two one three'.split(),
                   'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
#      A      B  C   D
# 0  foo    one  0   0
# 1  bar    one  1   2
# 2  foo    two  2   4
# 3  bar  three  3   6
# 4  foo    two  4   8
# 5  bar    two  5  10
# 6  foo    one  6  12
# 7  foo  three  7  14

print(df.loc[df['A'] == 'foo'])

yields

     A      B  C   D
0  foo    one  0   0
2  foo    two  2   4
4  foo    two  4   8
6  foo    one  6  12
7  foo  three  7  14

If you have multiple values you want to include, put them in a list (or more generally, any iterable) and use isin:

print(df.loc[df['B'].isin(['one','three'])])

yields

     A      B  C   D
0  foo    one  0   0
1  bar    one  1   2
3  bar  three  3   6
6  foo    one  6  12
7  foo  three  7  14

Note, however, that if you wish to do this many times, it is more efficient to make an index first, and then use df.loc:

df = df.set_index(['B'])
print(df.loc['one'])

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
one  foo  6  12

or, to include multiple values from the index use df.index.isin:

df.loc[df.index.isin(['one','two'])]

yields

       A  C   D
B              
one  foo  0   0
one  bar  1   2
two  foo  2   4
two  foo  4   8
two  bar  5  10
one  foo  6  12

Best Answer

Related Solutions

System call and context switch

Python – How to select rows from a DataFrame based on column values

Related Topic