我正在为采访者计算Spearman相关系数。它适用于Interviewer_1 ...我不明白Scipy如何打断面试者_2没有相关性/ 0 / nan。
import pandas as pd
from pandas import DataFrame
import scipy.stats
df = pd.DataFrame({'Interviewer': ['Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_1','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2','Interviewer_2'],
'Score_1': [-1,-1,-1,1,1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,1,-1],
'Score_2': [1,-1,-1,-1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]
})
df
样本数据产量:
Interviewer Score_1 Score_2
0 Interviewer_1 -1 1
1 Interviewer_1 -1 -1
2 Interviewer_1 -1 -1
3 Interviewer_1 1 -1
4 Interviewer_1 1 1
5 Interviewer_1 -1 1
6 Interviewer_1 -1 -1
7 Interviewer_1 -1 -1
8 Interviewer_1 1 -1
9 Interviewer_1 1 -1
10 Interviewer_2 -1 -1
11 Interviewer_2 -1 -1
12 Interviewer_2 -1 -1
13 Interviewer_2 -1 -1
14 Interviewer_2 -1 -1
15 Interviewer_2 -1 -1
16 Interviewer_2 -1 -1
17 Interviewer_2 -1 -1
18 Interviewer_2 -1 -1
19 Interviewer_2 -1 -1
20 Interviewer_2 -1 -1
21 Interviewer_2 -1 -1
22 Interviewer_2 -1 -1
23 Interviewer_2 1 -1
24 Interviewer_2 -1 -1
25 Interviewer_2 -1 -1
26 Interviewer_2 -1 -1
27 Interviewer_2 -1 -1
28 Interviewer_2 1 -1
29 Interviewer_2 -1 -1
df.groupby('Interviewer').sum()
产生总和:
Score_1 Score_2
Interviewer
Interviewer_1 -2 -4
Interviewer_2 -16 -20
使用Scipy:
def applyspearman(row):
row['Cor'] = scipy.stats.spearmanr(row['Score_1'], row['Score_2'])[0]
return row
df = df.groupby('Interviewer').apply(applyspearman)
df
Interviewer Score_1 Score_2 Cor
0 Interviewer_1 -1 1 -0.089087081
1 Interviewer_1 -1 -1 -0.089087081
2 Interviewer_1 -1 -1 -0.089087081
3 Interviewer_1 1 -1 -0.089087081
4 Interviewer_1 1 1 -0.089087081
5 Interviewer_1 -1 1 -0.089087081
6 Interviewer_1 -1 -1 -0.089087081
7 Interviewer_1 -1 -1 -0.089087081
8 Interviewer_1 1 -1 -0.089087081
9 Interviewer_1 1 -1 -0.089087081
10 Interviewer_2 -1 -1
11 Interviewer_2 -1 -1
12 Interviewer_2 -1 -1
13 Interviewer_2 -1 -1
14 Interviewer_2 -1 -1
15 Interviewer_2 -1 -1
16 Interviewer_2 -1 -1
17 Interviewer_2 -1 -1
18 Interviewer_2 -1 -1
19 Interviewer_2 -1 -1
20 Interviewer_2 -1 -1
21 Interviewer_2 -1 -1
22 Interviewer_2 -1 -1
23 Interviewer_2 1 -1
24 Interviewer_2 -1 -1
25 Interviewer_2 -1 -1
26 Interviewer_2 -1 -1
27 Interviewer_2 -1 -1
28 Interviewer_2 1 -1
29 Interviewer_2 -1 -1
我尝试在Excel中手动使用这个公式(等级函数,abs差异,d ^ 2和d ^的总和,并且对于两个采访者得到不同的结果: p = 1 - (6Σd^ 2i)/(n(n ^ 2-1))
面试官1,p = 0.878788
面试官_2,p = 0.993985
问题:
答案 0 :(得分:1)
不确定source中究竟发生了什么,但您可以使用pandas定义自己的功能' Series.rank(method='dense')
这似乎可以解决问题:
def spearmanr(x, y):
""" `x`, `y` --> pd.Series"""
assert x.shape == y.shape
rx = x.rank(method='dense')
ry = y.rank(method='dense')
d = rx - ry
dsq = np.sum(np.square(d))
n = x.shape[0]
coef = 1. - (6. * dsq) / (n * (n**2 - 1.))
return coef
grouped.apply(lambda frame: spearmanr(frame['Score_1'], frame['Score_2']))
Interviewer_1 0.970
Interviewer_2 0.998