我在熊猫中有以下数据框
code tank var nozzle_1 nozzle_2 nozzle_3 nozzle_tank
123 1 23.34 12.23 54.56 12.22 11
123 1 22.32 11.32 7.89 3.45 12
123 1 21.22 19.93 5.54 5.66 12
123 1 21.34 12.23 54.56 22.22 14
123 1 32.32 13.32 4.89 32.45 34
123 1 32.22 29.93 23.54 23.66 33
123 2 23.34 12.23 54.56 12.22 21
123 2 22.32 11.32 7.89 3.45 22
123 2 21.22 19.93 5.54 5.66 21
123 2 21.34 12.23 54.56 22.22 21
123 2 32.32 13.32 4.89 32.45 22
123 2 32.22 29.93 23.54 23.66 21
我想计算出Tank_1上的喷嘴_1,喷嘴_2,喷嘴_3和喷嘴_4与var列的相关性
我想要的数据框是
code tank nozzle_1 nozzle_2 nozzle_3 nozzle_4
123 1 0.08 0.01 0.02 0.01
123 2 0.07 0.01 0.02 0.02
我正在熊猫后面追随
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
cols= np.array(cols)
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
final_df = pd.DataFrame()
for col in nozzles_to_scale:
corrs = (df[[col, tank]].groupby(tank).corrwith(df.var_col ).reset_index())
final_df = final_df.join(corrs)
return final_df
但是它似乎不起作用,我们如何在熊猫中做到这一点?
test = corrVar(df, cols, var, tank)
答案 0 :(得分:1)
您可以使用:
cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
var_col = 'var'
tank = 'tank'
def corrVar(df, cols, var_col, tank):
final_df = [df[[col, tank]].groupby(tank).corrwith(df[var_col]) for col in cols]
return pd.concat(final_df, axis=1)
print (corrVar(df, cols, var_col, tank))
nozzle_1 nozzle_2 nozzle_3
tank
1 0.501164 -0.309435 0.761017
2 0.501164 -0.309435 0.761017
编辑:每组每个N
值的相关性解决方案:
N = 3
g = df.groupby('tank').cumcount() // N
cols = df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
var_col = 'var'
tank = 'tank'
code = 'code'
def corrVar(df, cols, var_col, tank, g):
#https://stackoverflow.com/a/48570300
final_df = [df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col]))
for col in cols]
return pd.concat(final_df, axis=1, keys=cols)
print (corrVar(df, cols, var_col, tank, g))
nozzle_1 nozzle_2 nozzle_3
tank
0 1 -0.826376 0.876202 0.703793
2 -0.826376 0.876202 0.703793
1 1 0.540176 -0.931286 0.614626
2 0.540176 -0.931286 0.614626
测试组:
print (df.assign(groups=g))
code tank var nozzle_1 nozzle_2 nozzle_3 nozzle_tank groups
0 123 1 23.34 12.23 54.56 12.22 11 0
1 123 1 22.32 11.32 7.89 3.45 12 0
2 123 1 21.22 19.93 5.54 5.66 12 0
3 123 1 21.34 12.23 54.56 22.22 14 1
4 123 1 32.32 13.32 4.89 32.45 34 1
5 123 1 32.22 29.93 23.54 23.66 33 1
6 123 2 23.34 12.23 54.56 12.22 21 0
7 123 2 22.32 11.32 7.89 3.45 22 0
8 123 2 21.22 19.93 5.54 5.66 21 0
9 123 2 21.34 12.23 54.56 22.22 21 1
10 123 2 32.32 13.32 4.89 32.45 22 1
11 123 2 32.22 29.93 23.54 23.66 21 1
编辑:
函数应为一行:
def corrVar(df, cols, var_col, tank, g):
return pd.concat([df.groupby([g, tank]).apply(lambda x: x[col].corr(x[var_col]))
for col in cols], axis=1, keys=cols)