我有一个数据帧df:
ID Var1 Var2
1 1.2 4
1 2.1 6
1 3.0 7
2 1.3 8
2 2.1 9
2 3.2 13
我想在每个Var1
找到Var2
和ID
之间的皮尔森相关系数值
所以结果应该是这样的:
ID Corr_Coef
1 0.98198
2 0.97073
更新
必须确保所有变量列均为int
或float
答案 0 :(得分:13)
要获得所需的输出格式,您可以使用.corrwith
:
corrs = (df[['Var1', 'ID']]
.groupby('ID')
.corrwith(df.Var2)
.rename(columns={'Var1' : 'Corr_Coef'}))
print(corrs)
Corr_Coef
ID
1 0.98198
2 0.97073
广义解决方案:
import numpy as np
def groupby_coef(df, col1, col2, on_index=True, squeeze=True, name='coef',
keys=None, **kwargs):
"""Grouped correlation coefficient between two columns
Flat result structure in contrast to `groupby.corr()`.
Parameters
==========
df : DataFrame
col1 & col2: str
Columns for which to calculate correlation coefs
on_index : bool, default True
Specify whether you're grouping on index
squeeze : bool, default True
True -> Series; False -> DataFrame
name : str, default 'coef'
Name of DataFrame column if squeeze == True
keys : column label or list of column labels / arrays
Passed to `pd.DataFrame.set_index`
**kwargs :
Passed to `pd.DataFrame.groupby`
"""
# If we are grouping on something other than the index, then
# set as index first to avoid hierarchical result.
# Kludgy, but safer than trying to infer.
if not on_index:
df = df.set_index(keys=keys)
if not kwargs:
# Assume we're grouping on 0th level of index
kwargs = {'level': 0}
grouped = df[[col1]].groupby(**kwargs)
res = grouped.corrwith(df[col2])
res.columns = [name]
if squeeze:
res = np.squeeze(res)
return res
示例:
df_1 = pd.DataFrame(np.random.randn(10, 2),
index=[1]*5 + [2]*5).add_prefix('var')
df_2 = df_1.reset_index().rename(columns={'index': 'var2'})
print(groupby_coef(df_1, 'var0', 'var1', level=0))
1 7.424e-18
2 -9.481e-19
Name: coef, dtype: float64
print(groupby_coef(df_2, col1='var0', col2='var1',
on_index=False, keys='var2'))
var2
1 7.424e-18
2 -9.481e-19
Name: coef, dtype: float64
答案 1 :(得分:8)
df.groupby('ID').corr()
输出:
Var1 Var2
ID
1 Var1 1.000000 0.981981
Var2 0.981981 1.000000
2 Var1 1.000000 0.970725
Var2 0.970725 1.000000
OP输出格式化。
df_out = df.groupby('ID').corr()
(df_out[~df_out['Var1'].eq(1)]
.reset_index(1, drop=True)['Var1']
.rename('Corr_Coef')
.reset_index())
输出:
ID Corr_Coef
0 1 0.981981
1 2 0.970725
答案 2 :(得分:0)
简单的解决方案:
df.groupby('ID').corr().unstack().iloc[:,1]