我在熊猫中有以下数据框
code tank var nozzle_1 nozzle_2 nozzle_3
123 1 23.34 12.23 54.56 12.22
123 1 22.32 11.32 7.89 3.45
123 1 21.22 19.93 5.54 5.66
123 1 21.34 12.23 54.56 22.22
123 1 32.32 13.32 4.89 32.45
123 1 32.22 29.93 23.54 23.66
123 2 23.34 12.23 54.56 12.22
123 2 22.32 11.32 7.89 3.45
123 2 21.22 19.93 5.54 5.66
123 2 21.34 12.23 54.56 22.22
123 2 32.32 13.32 4.89 32.45
123 2 32.22 29.93 23.54 23.66
我想找到与tank分组的var_1列相关的喷嘴_1,喷嘴_2,喷嘴_3的相关性,并每3行获取相关性
我想要的数据框是
code tank nozzle_1 nozzle_2 nozzle_3
123 1 0.20 0.30 0.23
123 1 0.12 0.08 0.12
123 2 0.14 0.12 0.01
123 2 0.15 0.04 0.13
我正在跟踪熊猫
cols= df.columns[df.columns.str.contains(pat='nozzle_\d+$', regex=True)]
cols= np.array(cols)
def corrVar(df, cols):
for col in cols_to_scale:
for i in range(0, df.shape[0], 3):
df[col] = df.groupby('tank')[col, 'var'].corr()
return df
test = corrVar(df,cols)
但是,它没有给我想要的结果。我们如何在熊猫中做到这一点?
答案 0 :(得分:0)
对此没有简单的解决方案,所以这是我的细分:
nozzle
开头的列的列索引var
的列索引GroupBy
上code, tank
,并为每个nozzle
列计算数据框的上半部的相关性Concat
的两半互相重叠,作为final
数据框cols_idx = [df.columns.get_loc(c) for c in df.filter(like='nozzle').columns]
var_idx = df.columns.get_loc('var')
df1 = pd.concat([
df.groupby(['code','tank']).apply(lambda x: x.iloc[:len(x)//2, var_idx].corr(x.iloc[:len(x)//2, idx])) for idx in cols_idx
], axis=1).reset_index()
df2 = pd.concat([
df.groupby(['code','tank']).apply(lambda x: x.iloc[len(x)//2:, var_idx].corr(x.iloc[len(x)//2:, idx])) for idx in cols_idx
], axis=1).reset_index()
df_final = pd.concat([df1,df2]).sort_values('tank').reset_index(drop=True)
输出
code tank 0 1 2
0 123 1 -0.826376 0.876202 0.703793
1 123 1 0.540176 -0.931286 0.614626
2 123 2 -0.826376 0.876202 0.703793
3 123 2 0.540176 -0.931286 0.614626
如果您想正确地重命名列
答案 1 :(得分:-1)
import pandas as pd
data = [
[ 123, 1, 23.34, 12.23, 54.56, 12.22 ],
[ 123, 1, 22.32, 11.32, 7.89, 3.45 ],
[ 123, 1, 21.22, 19.93, 5.54, 5.66 ],
[ 123, 1, 21.34, 12.23, 54.56, 22.22 ],
[ 123, 1, 32.32, 13.32, 4.89, 32.45 ],
[ 123, 1, 32.22, 29.93, 23.54, 23.66 ],
[ 123, 2, 23.34, 12.23, 54.56, 12.22 ],
[ 123, 2, 22.32, 11.32, 7.89, 3.45 ],
[ 123, 2, 21.22, 19.93, 5.54, 5.66 ],
[ 123, 2, 21.34, 12.23, 54.56, 22.22 ],
[ 123, 2, 32.32, 13.32, 4.89, 32.45 ],
[ 123, 2, 32.22, 29.93, 23.54, 23.66 ]
]
columns = ['code', 'tank', 'var', 'nozzle_1', 'nozzle_2', 'nozzle_3']
df = pd.DataFrame(data=data, columns=columns)
print(df[['tank', 'var', 'nozzle_1', 'nozzle_2', 'nozzle_3']].groupby(['tank']).corr())
# ------------------------------------------------------
# RESULT:
# var nozzle_1 nozzle_2 nozzle_3
# tank
# 1 var 1.000000 0.501164 -0.309435 0.761017
# nozzle_1 0.501164 1.000000 -0.214982 0.168518
# nozzle_2 -0.309435 -0.214982 1.000000 0.107815
# nozzle_3 0.761017 0.168518 0.107815 1.000000
# 2 var 1.000000 0.501164 -0.309435 0.761017
# nozzle_1 0.501164 1.000000 -0.214982 0.168518
# nozzle_2 -0.309435 -0.214982 1.000000 0.107815
# nozzle_3 0.761017 0.168518 0.107815 1.000000