I have dataset with three columns in Python notebook. It seems there are too many outliers out of 1.5 times IQR. I'm think how can I count the outliers for all columns?
If there are too many outliers, I may consider to remove the points considered as outliers for more than one feature. If so, how I can count it in that way?
Thanks!
Best Answer
Similar to Romain X.'s answer but operates on the DataFrame instead of Series.
Random data:
Quartile calculations:
And these are the numbers for each column:
In line with seaborn's calculations:
Note that the part before the sum (
(df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))
) is a boolean mask so you can use it directly to remove outliers. This sets them to NaN, for example: