Python – How to count outliers for all columns in Python

pandaspython

I have dataset with three columns in Python notebook. It seems there are too many outliers out of 1.5 times IQR. I'm think how can I count the outliers for all columns?

If there are too many outliers, I may consider to remove the points considered as outliers for more than one feature. If so, how I can count it in that way?

Thanks!

Best Answer

Similar to Romain X.'s answer but operates on the DataFrame instead of Series.

Random data:

np.random.seed(0)
df = pd.DataFrame(np.random.randn(100, 5), columns=list('ABCDE'))
df.iloc[::10] += np.random.randn() * 2  # this hopefully introduces some outliers
df.head()
Out: 
          A         B         C         D         E
0  2.529517  1.165622  1.744203  3.006358  2.633023
1 -0.977278  0.950088 -0.151357 -0.103219  0.410599
2  0.144044  1.454274  0.761038  0.121675  0.443863
3  0.333674  1.494079 -0.205158  0.313068 -0.854096
4 -2.552990  0.653619  0.864436 -0.742165  2.269755

Quartile calculations:

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

And these are the numbers for each column:

((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()
Out: 
A    1
B    0
C    0
D    1
E    2
dtype: int64

In line with seaborn's calculations:

Note that the part before the sum ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))) is a boolean mask so you can use it directly to remove outliers. This sets them to NaN, for example:

mask = (df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
df[mask] = np.nan

Related Solutions

Python – How to execute a program or call a system command

Use the subprocess module in the standard library:

import subprocess
subprocess.run(["ls", "-l"])

The advantage of subprocess.run over os.system is that it is more flexible (you can get the stdout, stderr, the "real" status code, better error handling, etc...).

Even the documentation for os.system recommends using subprocess instead:

The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.

On Python 3.4 and earlier, use subprocess.call instead of .run:

subprocess.call(["ls", "-l"])

Python – How to get the current time in Python

Use:

>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:

>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.

To save typing, you can import the datetime object from the datetime module:

>>> from datetime import datetime

Then remove the leading datetime. from all of the above.

Best Answer

Related Solutions

Python – How to execute a program or call a system command

Python – How to get the current time in Python

Related Topic