Python – How to do Pearson correlation of selected columns of a Pandas data frame

pandaspython

I have a CSV that looks like this:

gene,stem1,stem2,stem3,b1,b2,b3,special_col
foo,20,10,11,23,22,79,3
bar,17,13,505,12,13,88,1
qui,17,13,5,12,13,88,3

And as data frame it looks like this:

In [17]: import pandas as pd
In [20]: df = pd.read_table("http://dpaste.com/3PQV3FA.txt",sep=",")
In [21]: df
Out[21]:
  gene  stem1  stem2  stem3  b1  b2  b3  special_col
0  foo     20     10     11  23  22  79            3
1  bar     17     13    505  12  13  88            1
2  qui     17     13      5  12  13  88            3

What I want to do is to perform pearson correlation from last column (special_col) with every columns between gene column and special column, i.e. colnames[1:number_of_column-1]

At the end of the day we will have length 6 data frame.

Coln   PearCorr
stem1  0.5
stem2 -0.5
stem3 -0.9999453506011533
b1    0.5
b2    0.5
b3    -0.5

The above value is computed manually:

In [27]: import scipy.stats
In [39]: scipy.stats.pearsonr([3, 1, 3], [11,505,5])
Out[39]: (-0.9999453506011533, 0.0066556395400007278)

How can I do that?

Best Answer

Note there is a mistake in your data, there special col is all 3, so no correlation can be computed.

If you remove the column selection in the end you'll get a correlation matrix of all other columns you are analysing. The last [:-1] is to remove correlation of 'special_col' with itself.

In [15]: data[data.columns[1:]].corr()['special_col'][:-1]
Out[15]: 
stem1    0.500000
stem2   -0.500000
stem3   -0.999945
b1       0.500000
b2       0.500000
b3      -0.500000
Name: special_col, dtype: float64

If you are interested in speed, this is slightly faster on my machine:

In [33]: np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
Out[33]: 
array([ 0.5       , -0.5       , -0.99994535,  0.5       ,  0.5       ,
       -0.5       ])

In [34]: %timeit np.corrcoef(data[data.columns[1:]].T)[-1][:-1]
1000 loops, best of 3: 437 µs per loop

In [35]: %timeit data[data.columns[1:]].corr()['special_col']
1000 loops, best of 3: 526 µs per loop

But obviously, it returns an array, not a pandas series/DF.