Python – pandas GroupBy columns with NaN (missing) values

group-bynanpandaspandas-groupbypython

I have a DataFrame with many missing values in columns which I wish to groupby:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})

In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}

see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)

Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.

Any suggestions? Should I write a function for this or is there a simple solution?

Best Answer

pandas >= 1.1

From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:

pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'

# Example from the docs
df

   a    b  c
0  1  2.0  3
1  1  NaN  4
2  2  1.0  3
3  1  2.0  2

# without NA (the default)
df.groupby('b').sum()

     a  c
b        
1.0  2  3
2.0  2  5

# with NA
df.groupby('b', dropna=False).sum()

     a  c
b        
1.0  2  3
2.0  2  5
NaN  1  4

Related Solutions

Python – How to check for NaN values

math.isnan(x)

Return True if x is a NaN (not a number), and False otherwise.

>>> import math
>>> x = float('nan')
>>> math.isnan(x)
True

Python – Converting a Pandas GroupBy output from Series to DataFrame

g1 here is a DataFrame. It has a hierarchical index, though:

In [19]: type(g1)
Out[19]: pandas.core.frame.DataFrame

In [20]: g1.index
Out[20]: 
MultiIndex([('Alice', 'Seattle'), ('Bob', 'Seattle'), ('Mallory', 'Portland'),
       ('Mallory', 'Seattle')], dtype=object)

Perhaps you want something like this?

In [21]: g1.add_suffix('_Count').reset_index()
Out[21]: 
      Name      City  City_Count  Name_Count
0    Alice   Seattle           1           1
1      Bob   Seattle           2           2
2  Mallory  Portland           2           2
3  Mallory   Seattle           1           1

Or something like:

In [36]: DataFrame({'count' : df1.groupby( [ "Name", "City"] ).size()}).reset_index()
Out[36]: 
      Name      City  count
0    Alice   Seattle      1
1      Bob   Seattle      2
2  Mallory  Portland      2
3  Mallory   Seattle      1

Best Answer

pandas >= 1.1

Related Solutions

Python – How to check for NaN values

Python – Converting a Pandas GroupBy output from Series to DataFrame

Related Topic