您好,希望能得到一些帮助,我有一个这样的Dataframe df:
label cell_name hour kpi1 kpi2
train c1 1 10 20
train c1 2 10 44
train c1 3 11 33
train c1 4 5 1
train c1 5 2 6
test c1 1 78 66
test c1 2 45 2
test c1 3 23 12
test c1 4 65 45
test c1 5 86 76
我的意图是有条件地从测试集的kpi1,kpi2列中减去say(50),然后用训练集(groupby cell和hour)将相同的列相除,然后将其附加到原始数据帧,以便新列看起来像;
label cell_name hour kpi1 kpi2 kpi1_index kpi2_index
train c1 1 10 20
train c1 2 10 44
train c1 3 11 33
train c1 4 5 1
train c1 5 2 6
test c1 1 78 66 2.8 0.8
test c1 2 45 2 -0.5 -1.09
test c1 3 23 12 -2.45 -1.15
test c1 4 65 45 3 -5
test c1 5 86 76 18 4.33
我尝试了以下代码:
import pandas as pd
import os
rr=os.getcwd()
df=pd.read_excel(rr+'\\KPI_test_train.xlsx')
print(df.columns)
def f(x,y):
return ((x-50)/y)
df_grouped = df.groupby(['label'])
[dtest,dtrain]=[y for x,y in df_grouped]
dtest=dtest.groupby(['label','cell_name','hour']).sum()
dtrain=dtrain.groupby(['label','cell_name','hour']).sum()
for i in dtest.columns:
dtest[i+'_index']=f(dtest[i],dtrain[i])
函数f返回所有行的NaN值。 但是考虑到这些事情上熊猫通常很漂亮,这有点令人讨厌。内置的方法是什么?
答案 0 :(得分:1)
我认为这里最好分别处理每个DataFrame
-因此,首先使用DataFrame.pop
进行条件过滤以提取列,按列创建MultiIndex
进行对齐并为所有值应用公式。然后将DataFrame.add_suffix
和DataFrame.join
添加到test
DataFrame中,如果需要使用一个DataFrame,则最后使用concat
:
lab = df.pop('label')
dtest = df[lab.eq('train')].set_index(['cell_name','hour'])
dtrain = df[lab.eq('test')].set_index(['cell_name','hour'])
df = dtest.join(((dtrain - 50) / dtest).add_suffix('_index'))
df = (pd.concat([dtrain, df], keys=('train','test'), sort=False)
.reset_index()
.rename(columns={'level_0':'label'}))
print (df)
label cell_name hour kpi1 kpi2 kpi1_index kpi2_index
0 train c1 1 78 66 NaN NaN
1 train c1 2 45 2 NaN NaN
2 train c1 3 23 12 NaN NaN
3 train c1 4 65 45 NaN NaN
4 train c1 5 86 76 NaN NaN
5 test c1 1 10 20 2.800000 0.800000
6 test c1 2 10 44 -0.500000 -1.090909
7 test c1 3 11 33 -2.454545 -1.151515
8 test c1 4 5 1 3.000000 -5.000000
9 test c1 5 2 6 18.000000 4.333333