Category SubCategory Month Value
A A1 Jan 1
A A1 Feb 2
A A1 Mar 3
A A2 Jan 2
A A2 Feb 3
A A2 Mar 5
B B1 Jan 1
B B1 Feb 6
B B1 Mar 7
B B2 Jan 3
B B2 Feb 6
B B2 Mar 7
我有一个这样的样本df熊猫。我想计算子类别A1和A2,B1和B2之间的相关系数,而不是A1和B1等。我的最终目标是要有一个这样的表:
A1 A2 B1 B2
A1 1.0000 0.9820
A2 0.9820 1.0000
B1 1.0000 0.9963
B2 0.9963 1.0000
有人可以用python代码帮助我吗?
很明显,这给我每个子类别的corr值为1
df.groupby('SubCategory').corr()
答案 0 :(得分:3)
首先是pivot
问题,然后仅使用corr
pd.concat([x.pivot('Month','SubCategory','Value').corr() for _,x in df.groupby('Category')])
A1 A2 B1 B2
SubCategory
A1 1.000000 0.981981 NaN NaN
A2 0.981981 1.000000 NaN NaN
B1 NaN NaN 1.000000 0.996271
B2 NaN NaN 0.996271 1.000000
答案 1 :(得分:0)
数据
import pandas as pd
df = pd.DataFrame({"Category" : ["A", "A", "A", "A", "A", "A",
"B", "B", "B", "B", "B", "B"],
"SubCategory": ["A1", "A1", "A1", "A2", "A2", "A2",
"B1", "B1", "B1", "B2", "B2", "B2"],
"Value": [1, 2, 3, 2, 3, 5,
1, 6, 7, 3, 6, 7]})
解决方案
import scipy as sp
# this will contain a list of DataFrames storing the correlation matrices
correlations = []
for g in df.groupby("Category"):
sub_df = g[1][["SubCategory", "Value"]]
data = sub_df.pivot_table(columns="SubCategory", values="Value", aggfunc=list)
correlation = pd.DataFrame(sp.corrcoef(data.values.tolist()[0]),
columns=data.columns.values.tolist(),
index=data.columns.values.tolist())
correlations.append(correlation)
pd.concat(correlations, sort=False)
输出
A1 A2 B1 B2
________________________________________________
A1 1.000000 0.996271 NaN NaN
A2 0.996271 1.000000 NaN NaN
B1 NaN NaN 1.000000 0.996271
B2 NaN NaN 0.996271 1.000000
更新
此解决方案已在python和pandas版本上进行了测试,如下所示,较旧的版本可能无法正常工作:
from platform import python_version
print('python version:', python_version())
import pandas as pd
print('pandas version:', pd.__version__)
python version: 3.7.0
pandas version: 0.23.4