目前iam将一些R脚本翻译成Python。但是我在以下几行中苦苦挣扎:
return(trackTable[, .(
AVERAGE_WIND_COMPONENT = sum(TRACK_WIND_COMPONENT*GROUND_DIST, na.rm = T)/sum(GROUND_DIST, na.rm = T) #PRÜFEN!!!!!
), by=KEY_COLUMN])
现在我试图用python重写R代码:
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
AVERAGE_WIND_COMPONENT = pd.DataFrame({'KEY_COLUMN':AVERAGE_WIND_COMPONENT.index, 'AVERAGE_WIND_COMPONENT':AVERAGE_WIND_COMPONENT.values})
但是我对AVERAGE_WIND_COMPONENT
的结果是错误的...我在这里翻译错了什么?可能是groupby,并且在我建立临时列时。
示例df:
KEY_COLUMN track_wind_component ground_dist
0 xyz -0.000000 2.262407
1 xyz 0.000000 9.769840
2 xyz -135.378229 38.581616
3 xyz 11.971863 30.996997
4 xyz -78.208083 45.404430
5 xyz -88.718762 48.589553
6 xyz -118.302506 22.193426
7 xyz -71.033648 76.602917
8 xyz -68.369886 11.092901
9 xyz -65.706124 6.210328
10 xyz -60.822561 17.444752
11 xyz 39.630277 18.082869
12 xyz 102.477706 35.175366
13 xyz 43.061773 8.793499
14 xyz -71.036785 15.289568
15 xyz 65.246215 49.247986
16 xyz -29.249612 1.043781
17 xyz -25.848495 11.490416
18 xyz -11.223688 NaN
此KEY_COLUMN的预期结果:-36.8273304
答案 0 :(得分:0)
好的,现在您的预期结果有意义了
首先创建一个使用np.sum()的函数,它等于R的sum(value,na.rm = T)
def my_agg(df):
names = {
'result': np.sum(df['track_wind_component'] * df['ground_dist']) / np.sum(df['ground_dist'])
}
return pd.Series(names, index=['result'])
df.groupby('KEY_COLUMN').apply(my_agg)
退出:
result
KEY_COLUMN
xyz -36.827331
您的代码出了什么问题
table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
# this is just creating a column that is the exact same as
# table['track_wind_component'] because, for example, (x*y)/y = x
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
# you are now essentially just grouping and summing the track_wind_column
R代码的作用是将(table['track_wind_component'] * table['ground_dist'])
的总和除以(table['ground_dist'])
的总和
全部按key_column分组
R代码也忽略了NaN值,这就是为什么我使用np.sum()