2个数据帧。 1短1长。我想使用相关系数将长整数分解为几块,与短整数进行比较。
分割很好。但是,将它们进行计算时,它将返回Nan。
import pandas as pd
data_a = {'ID': ["a1","a2","a3","a4","a5","a6","a7","a8","a9","a10","a11","a12","a13","a14","a15"],
'Unit_Weight': [178,153,193,195,214,157,205,212,219,166,217,186,170,207,204]}
df_a = pd.DataFrame(data_a)
data_b = {'ID': ["b1","b2","b3","b4","b5"],
'Unit_Weight': [128,123,123,125,204]}
df_b = pd.DataFrame(data_b)
size = 5 # 5 rows in the long data-frame
list_of_df_a = [df_a.loc[i:i+size-1,:] for i in range(0, len(df_a),size)]
for each in list_of_df_a:
corr_e = each['Unit_Weight'].corr(df_b['Unit_Weight'])
输出:
0.6797202605786716
nan
nan
出了什么问题,如何纠正?谢谢。
p.s .:这些是手动计算的结果:
0.6797202605786716
-0.5501914564062937
0.2653370297540246
ID Unit_Weight
0 a1 178
1 a2 153
2 a3 193
3 a4 195
4 a5 214
ID Unit_Weight
5 a6 157
6 a7 205
7 a8 212
8 a9 219
9 a10 166
ID Unit_Weight
10 a11 217
11 a12 186
12 a13 170
13 a14 207
14 a15 204
答案 0 :(得分:1)
两个Series
中必须有相同的索引,因此将DataFrame.reset_index
与drop=True
一起使用:
for each in list_of_df_a:
corr_e = each['Unit_Weight'].reset_index(drop=True).corr(df_b['Unit_Weight'])
print (corr_e)
0.6797202605786716
-0.5501914564062937
0.26533702975402457
答案 1 :(得分:1)
@jezrael有一个很好的答案,但是另一种方法是更改:
list_of_df_a = [df_a.loc[i:i+size-1,:] for i in range(0, len(df_a),size)]
收件人:
list_of_df_a = [df_a.loc[i:i+size-1,:].reset_index(drop=True) for i in range(0, len(df_a),size)]
现在您的结果将是:
0.6797202605786716
-0.5501914564062937
0.26533702975402457
答案 2 :(得分:0)
您还可以使用numpy.corrcoef
自动解决索引问题:
for each in list_of_df_a:
corr_e = np.corrcoef(each['Unit_Weight'], df_b['Unit_Weight'])[0,1]
print(corr_e)
0.6797202605786716
-0.5501914564062937
0.2653370297540246