如何计算多变量1列的相关系数

时间:2018-12-13 04:29:08

标签: python pandas correlation

Category  SubCategory  Month  Value
A         A1           Jan     1
A         A1           Feb     2
A         A1           Mar     3
A         A2           Jan     2
A         A2           Feb     3
A         A2           Mar     5
B         B1           Jan     1
B         B1           Feb     6
B         B1           Mar     7
B         B2           Jan     3
B         B2           Feb     6
B         B2           Mar     7

我有一个这样的样本df熊猫。我想计算子类别A1和A2,B1和B2之间的相关系数,而不是A1和B1等。我的最终目标是要有一个这样的表:

    A1        A2        B1     B2
A1  1.0000  0.9820      
A2  0.9820  1.0000      
B1                    1.0000    0.9963
B2                    0.9963    1.0000

有人可以用python代码帮助我吗?

很明显,这给我每个子类别的corr值为1

df.groupby('SubCategory').corr()

2 个答案:

答案 0 :(得分:3)

首先是pivot问题,然后仅使用corr

pd.concat([x.pivot('Month','SubCategory','Value').corr() for _,x in df.groupby('Category')])
                   A1        A2        B1        B2
SubCategory                                        
A1           1.000000  0.981981       NaN       NaN
A2           0.981981  1.000000       NaN       NaN
B1                NaN       NaN  1.000000  0.996271
B2                NaN       NaN  0.996271  1.000000

答案 1 :(得分:0)

数据

import pandas as pd
df = pd.DataFrame({"Category" :   ["A", "A", "A", "A", "A", "A", 
                                   "B", "B", "B", "B", "B", "B"], 
                   "SubCategory": ["A1", "A1", "A1", "A2", "A2", "A2", 
                                   "B1", "B1", "B1", "B2", "B2", "B2"],
                   "Value":       [1, 2, 3, 2, 3, 5, 
                                   1, 6, 7, 3, 6, 7]})

解决方案

import scipy as sp
# this will contain a list of DataFrames storing the correlation matrices
correlations = []
for g in df.groupby("Category"):
    sub_df = g[1][["SubCategory", "Value"]]
    data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list)
    correlation = pd.DataFrame(sp.corrcoef(data.values.tolist()[0]), 
                               columns=data.columns.values.tolist(), 
                               index=data.columns.values.tolist())
    correlations.append(correlation)
pd.concat(correlations, sort=False)

输出

    A1          A2          B1          B2
________________________________________________
A1  1.000000    0.996271    NaN         NaN
A2  0.996271    1.000000    NaN         NaN
B1  NaN         NaN         1.000000    0.996271
B2  NaN         NaN         0.996271    1.000000

更新

此解决方案已在python和pandas版本上进行了测试,如下所示,较旧的版本可能无法正常工作:

from platform import python_version
print('python version:', python_version())
import pandas as pd
print('pandas version:', pd.__version__)

    python version: 3.7.0
    pandas version: 0.23.4