Python Pandas : pivot table with aggfunc = count unique distinct

pandaspivot tablepython

This code:

df2 = (
    pd.DataFrame({
        'X' : ['X1', 'X1', 'X1', 'X1'], 
        'Y' : ['Y2', 'Y1', 'Y1', 'Y1'], 
        'Z' : ['Z3', 'Z1', 'Z1', 'Z2']
    })
)
g = df2.groupby('X')
pd.pivot_table(g, values='X', rows='Y', cols='Z', margins=False, aggfunc='count')

returns the following error:

Traceback (most recent call last): ... 
AttributeError: 'Index' object has no attribute 'index'

How do I get a Pivot Table with counts of unique values of one DataFrame column for two other columns?
Is there aggfunc for count unique? Should I be using np.bincount()?

NB. I am aware of pandas.Series.values_counts() however I need a pivot table.

EDIT: The output should be:

Z   Z1  Z2  Z3
Y             
Y1   1   1 NaN
Y2 NaN NaN   1

Best Answer

Do you mean something like this?

>>> df2.pivot_table(values='X', index='Y', columns='Z', aggfunc=lambda x: len(x.unique()))

Z   Z1  Z2  Z3
Y             
Y1   1   1 NaN
Y2 NaN NaN   1

Note that using len assumes you don't have NAs in your DataFrame. You can do x.value_counts().count() or len(x.dropna().unique()) otherwise.

Related Solutions

Python – How to upgrade all Python packages with pip

There isn't a built-in flag yet, but you can use

pip list --outdated --format=freeze | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

Note: there are infinite potential variations for this. I'm trying to keep this answer short and simple, but please do suggest variations in the comments!

In older version of pip, you can use this instead:

pip freeze --local | grep -v '^\-e' | cut -d = -f 1  | xargs -n1 pip install -U

The grep is to skip editable ("-e") package definitions, as suggested by @jawache. (Yes, you could replace grep+cut with sed or awk or perl or...).

The -n1 flag for xargs prevents stopping everything if updating one package fails (thanks @andsens).

Python – Adding new column to existing DataFrame in Python pandas

Edit 2017

As indicated in the comments and by @Alexander, currently the best method to add the values of a Series as a new column of a DataFrame could be using assign:

df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)

Edit 2015
Some reported getting the SettingWithCopyWarning with this code.
However, the code still runs perfectly with the current pandas version 0.16.1.

>>> sLength = len(df1['a'])
>>> df1
          a         b         c         d
6 -0.269221 -0.026476  0.997517  1.294385
8  0.917438  0.847941  0.034235 -0.448948

>>> df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e
6 -0.269221 -0.026476  0.997517  1.294385  1.757167
8  0.917438  0.847941  0.034235 -0.448948  2.228131

>>> pd.version.short_version
'0.16.1'

The SettingWithCopyWarning aims to inform of a possibly invalid assignment on a copy of the Dataframe. It doesn't necessarily say you did it wrong (it can trigger false positives) but from 0.13.0 it let you know there are more adequate methods for the same purpose. Then, if you get the warning, just follow its advise: Try using .loc[row_index,col_indexer] = value instead

>>> df1.loc[:,'f'] = pd.Series(np.random.randn(sLength), index=df1.index)
>>> df1
          a         b         c         d         e         f
6 -0.269221 -0.026476  0.997517  1.294385  1.757167 -0.050927
8  0.917438  0.847941  0.034235 -0.448948  2.228131  0.006109
>>>

In fact, this is currently the more efficient method as described in pandas docs

Original answer:

Use the original df1 indexes to create the series:

df1['e'] = pd.Series(np.random.randn(sLength), index=df1.index)

Best Answer

Related Solutions

Python – How to upgrade all Python packages with pip

Python – Adding new column to existing DataFrame in Python pandas

Related Topic