Scipy Spearman相关系数在某些情况下是NaN

时间:2017-11-29 22:17:55

标签: python python-3.x pandas scipy correlation

我正在为采访者计算Spearman相关系数。它适用于Interviewer_1 ...我不明白Scipy如何打断面试者_2没有相关性/ 0 / nan。

import pandas as pd
from pandas import DataFrame
import scipy.stats


df = pd.DataFrame({'Interviewer': ['Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2'],
                    'Score_1': [-1,-1,-1,1,1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,1,-1],
                    'Score_2': [1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]
                    })

df

样本数据产量:

    Interviewer Score_1 Score_2
0   Interviewer_1   -1  1
1   Interviewer_1   -1  -1
2   Interviewer_1   -1  -1
3   Interviewer_1   1   -1
4   Interviewer_1   1   1
5   Interviewer_1   -1  1
6   Interviewer_1   -1  -1
7   Interviewer_1   -1  -1
8   Interviewer_1   1   -1
9   Interviewer_1   1   -1
10  Interviewer_2   -1  -1
11  Interviewer_2   -1  -1
12  Interviewer_2   -1  -1
13  Interviewer_2   -1  -1
14  Interviewer_2   -1  -1
15  Interviewer_2   -1  -1
16  Interviewer_2   -1  -1
17  Interviewer_2   -1  -1
18  Interviewer_2   -1  -1
19  Interviewer_2   -1  -1
20  Interviewer_2   -1  -1
21  Interviewer_2   -1  -1
22  Interviewer_2   -1  -1
23  Interviewer_2   1   -1
24  Interviewer_2   -1  -1
25  Interviewer_2   -1  -1
26  Interviewer_2   -1  -1
27  Interviewer_2   -1  -1
28  Interviewer_2   1   -1
29  Interviewer_2   -1  -1

df.groupby('Interviewer').sum()

产生总和:

           Score_1  Score_2
Interviewer     
Interviewer_1   -2  -4
Interviewer_2   -16 -20

使用Scipy:

def applyspearman(row):
    row['Cor'] = scipy.stats.spearmanr(row['Score_1'], row['Score_2'])[0]
    return row

df = df.groupby('Interviewer').apply(applyspearman)

df
    Interviewer Score_1 Score_2 Cor
0   Interviewer_1   -1  1   -0.089087081
1   Interviewer_1   -1  -1  -0.089087081
2   Interviewer_1   -1  -1  -0.089087081
3   Interviewer_1   1   -1  -0.089087081
4   Interviewer_1   1   1   -0.089087081
5   Interviewer_1   -1  1   -0.089087081
6   Interviewer_1   -1  -1  -0.089087081
7   Interviewer_1   -1  -1  -0.089087081
8   Interviewer_1   1   -1  -0.089087081
9   Interviewer_1   1   -1  -0.089087081
10  Interviewer_2   -1  -1  
11  Interviewer_2   -1  -1  
12  Interviewer_2   -1  -1  
13  Interviewer_2   -1  -1  
14  Interviewer_2   -1  -1  
15  Interviewer_2   -1  -1  
16  Interviewer_2   -1  -1  
17  Interviewer_2   -1  -1  
18  Interviewer_2   -1  -1  
19  Interviewer_2   -1  -1  
20  Interviewer_2   -1  -1  
21  Interviewer_2   -1  -1  
22  Interviewer_2   -1  -1  
23  Interviewer_2   1   -1  
24  Interviewer_2   -1  -1  
25  Interviewer_2   -1  -1  
26  Interviewer_2   -1  -1  
27  Interviewer_2   -1  -1  
28  Interviewer_2   1   -1  
29  Interviewer_2   -1  -1

我尝试在Excel中手动使用这个公式(等级函数,abs差异,d ^ 2和d ^的总和,并且对于两个采访者得到不同的结果: p = 1 - (6Σd^ 2i)/(n(n ^ 2-1))

面试官1,p = 0.878788

面试官_2,p = 0.993985

问题

  1. 为什么Interviewer_2为空? NaN问题是否与排名联系有关?
  2. 为什么Scipy的结果与我的结果不同?

1 个答案:

答案 0 :(得分:1)

不确定source中究竟发生了什么,但您可以使用pandas定义自己的功能' Series.rank(method='dense')这似乎可以解决问题:

def spearmanr(x, y):
    """ `x`, `y` --> pd.Series"""
    assert x.shape == y.shape
    rx = x.rank(method='dense')
    ry = y.rank(method='dense')
    d = rx - ry
    dsq = np.sum(np.square(d))
    n = x.shape[0]
    coef = 1. - (6. * dsq) / (n * (n**2 - 1.))
    return coef

grouped.apply(lambda frame: spearmanr(frame['Score_1'], frame['Score_2']))
Interviewer_1    0.970
Interviewer_2    0.998