Groupby Aggregate方法总是返回NaN

时间:2016-07-23 07:14:12

标签: python pandas

您好我遇到了这个问题,我的数据源事件如下所示:

   event_id             device_id            timestamp  longitude  latitude
0         1     29182687948017175  2016-05-01 00:55:25     121.38     31.24
1         2  -6401643145415154744  2016-05-01 00:54:12     103.65     30.97
2         3  -4833982096941402721  2016-05-01 00:08:05     106.60     29.7

我试图通过device_id对事件进行分组,然后使用该device_id获取每个事件的变量sum / mean / std:

events['latitude_mean'] = events.groupby(['device_id'])['latitude'].aggregate(np.sum)

但我的输出总是:

event_id             device_id            timestamp  longitude  latitude
0         1     29182687948017175  2016-05-01 00:55:25     121.38     31.24   
1         2  -6401643145415154744  2016-05-01 00:54:12     103.65     30.97   
2         3  -4833982096941402721  2016-05-01 00:08:05     106.60     29.70   
3         4  -6815121365017318426  2016-05-01 00:06:40     104.27     23.28   
4         5  -5373797595892518570  2016-05-01 00:07:18     115.88     28.66   

   latitude_mean  
0            NaN  
1            NaN  
2            NaN  
3            NaN  
4            NaN

为了让每一行的返回值保持为NaN,我做错了什么?

1 个答案:

答案 0 :(得分:4)

您可以使用pandas.core.groupby.GroupBy.transform(aggfunc)方法,该方法将aggfunc应用于每个组中的所有行:

In [32]: events['latitude_mean'] = events.groupby(['device_id'])['latitude'].transform('sum')

In [33]: events
Out[33]:
   event_id            device_id            timestamp  longitude  latitude  latitude_mean
0         1    29182687948017175  2016-05-01 00:55:25     121.38     31.24          62.55
1         2    29182687948017175  2016-05-30 12:12:12     777.77     31.31          62.55
2         3 -6401643145415154744  2016-05-01 00:54:12     103.65     30.97          64.30
3         4 -6401643145415154744  2016-01-01 11:11:11     111.11     33.33          64.30

Here you may find some usage examples

说明:当您对DF进行分组时 - 结果您通常会有一个包含较少行且索引不同的系列,因此pandas在将其分配给新的时不知道如何对齐它列,因此你有NaN's:

In [31]: events.groupby(['device_id'])['latitude'].agg(np.sum)
Out[31]:
device_id
-6401643145415154744    64.30
 29182687948017175      62.55
Name: latitude, dtype: float64

因此,当您尝试将其分配给新列时,pandas会执行以下操作:

In [36]: events['nans'] = pd.Series([1,2], index=['a','b'])

In [38]: events[['event_id','nans']]
Out[38]:
   event_id  nans
0         1   NaN
1         2   NaN
2         3   NaN
3         4   NaN

数据:

In [30]: events
Out[30]:
   event_id            device_id            timestamp  longitude  latitude
0         1    29182687948017175  2016-05-01 00:55:25     121.38     31.24
1         2    29182687948017175  2016-05-30 12:12:12     777.77     31.31
2         3 -6401643145415154744  2016-05-01 00:54:12     103.65     30.97
3         4 -6401643145415154744  2016-01-01 11:11:11     111.11     33.33