熊猫分组和计算得出的总和

时间:2018-10-27 18:36:10

标签: python r pandas

目前iam将一些R脚本翻译成Python。但是我在以下几行中苦苦挣扎:

  return(trackTable[, .(
    AVERAGE_WIND_COMPONENT = sum(TRACK_WIND_COMPONENT*GROUND_DIST, na.rm = T)/sum(GROUND_DIST, na.rm = T) #PRÜFEN!!!!!
  ), by=KEY_COLUMN])

现在我试图用python重写R代码:

table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']
AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()
AVERAGE_WIND_COMPONENT = pd.DataFrame({'KEY_COLUMN':AVERAGE_WIND_COMPONENT.index, 'AVERAGE_WIND_COMPONENT':AVERAGE_WIND_COMPONENT.values})

但是我对AVERAGE_WIND_COMPONENT的结果是错误的...我在这里翻译错了什么?可能是groupby,并且在我建立临时列时。

示例df:

    KEY_COLUMN  track_wind_component    ground_dist
0   xyz -0.000000   2.262407
1   xyz 0.000000    9.769840
2   xyz -135.378229 38.581616
3   xyz 11.971863   30.996997
4   xyz -78.208083  45.404430
5   xyz -88.718762  48.589553
6   xyz -118.302506 22.193426
7   xyz -71.033648  76.602917
8   xyz -68.369886  11.092901
9   xyz -65.706124  6.210328
10  xyz -60.822561  17.444752
11  xyz 39.630277   18.082869
12  xyz 102.477706  35.175366
13  xyz 43.061773   8.793499
14  xyz -71.036785  15.289568
15  xyz 65.246215   49.247986
16  xyz -29.249612  1.043781
17  xyz -25.848495  11.490416
18  xyz -11.223688  NaN

此KEY_COLUMN的预期结果:-36.8273304

1 个答案:

答案 0 :(得分:0)

好的,现在您的预期结果有意义了

首先创建一个使用np.sum()的函数,它等于R的sum(value,na.rm = T)

def my_agg(df):
    names = {
        'result': np.sum(df['track_wind_component'] * df['ground_dist']) / np.sum(df['ground_dist'])
    }

    return pd.Series(names, index=['result'])

df.groupby('KEY_COLUMN').apply(my_agg)

退出:

            result
KEY_COLUMN  
xyz        -36.827331

您的代码出了什么问题

table['temp'] = (table['track_wind_component'] * table['ground_dist']) / table['ground_dist']

# this is just creating a column that is the exact same as
# table['track_wind_component'] because, for example, (x*y)/y = x

AVERAGE_WIND_COMPONENT = table.groupby(['KEY_COLUMN'])['temp'].sum()

# you are now essentially just grouping and summing the track_wind_column

R代码的作用是将(table['track_wind_component'] * table['ground_dist'])的总和除以(table['ground_dist'])的总和

全部按key_column分组

R代码也忽略了NaN值,这就是为什么我使用np.sum()