Pandas 0.18.1 groupby并重新采样,出现多级聚合错误

时间:2016-08-09 22:11:29

标签: python-2.7 pandas

我刚刚将pandas从0.17.1更新到0.18.1,并认为我在更改一些预先存在的代码时发现了下面概述的新重采样方法的问题。根据此文档,我的下面示例中的df3_resample和df4_resample应返回相同的数据帧,但df4_resample会引发异常。这让我绊倒了一段时间,所以我想我会分享。

Exception: Column(s) A already selected

http://pandas.pydata.org/pandas-docs/version/0.18.0/whatsnew.html#whatsnew-0180-breaking-resample

http://pandas.pydata.org/pandas-docs/version/0.18.1/whatsnew.html#groupby-syntax-with-window-and-resample-operations

df = pd.DataFrame(np.random.rand(10,4),
              columns=list('ABCD'),
              index=pd.date_range('2010-01-01 09:00:00', periods=10, freq='s'))
df['item'] = 'item_a' # add column for groupby

# THIS WORKS 
df1_resample = df.groupby('item').resample('2s').agg({'A': np.mean, 'B': np.max}).reset_index()
print df1_resample

# THIS WORKS 
df2_resample = df.resample('2s').agg({'A': {'A_mean': np.mean, 'A_max': np.max}}).reset_index()
print df2_resample

# THIS WORKS 
df3_resample = df.groupby('item').apply(lambda x: x.resample('2s').agg({'A': {'A_mean': np.mean, 'A_max': np.max}})).reset_index()
print df3_resample

# THIS DOESN"T WORKS 
df4_resample = df.groupby('item').resample('2s').agg({'A': {'A_mean': np.mean, 'A_max': np.max}})
print df4_resample

输出:

 item             level_1         A         B
0  item_a 2010-01-01 09:00:00  0.611660  0.739640 
1  item_a 2010-01-01 09:00:02  0.615876  0.880113
2  item_a 2010-01-01 09:00:04  0.218292  0.441504
3  item_a 2010-01-01 09:00:06  0.753698  0.637787
4  item_a 2010-01-01 09:00:08  0.471272  0.474738
                  index         A          
                         A_mean     A_max
0 2010-01-01 09:00:00  0.611660  0.813038
1 2010-01-01 09:00:02  0.615876  0.994657
2 2010-01-01 09:00:04  0.218292  0.233478
3 2010-01-01 09:00:06  0.753698  0.848107
4 2010-01-01 09:00:08  0.471272  0.610592
     item             level_1         A          
                                 A_mean     A_max
0  item_a 2010-01-01 09:00:00  0.611660  0.813038
1  item_a 2010-01-01 09:00:02  0.615876  0.994657
2  item_a 2010-01-01 09:00:04  0.218292  0.233478
3  item_a 2010-01-01 09:00:06  0.753698  0.848107
4  item_a 2010-01-01 09:00:08  0.471272  0.610592


  File "<some_file.py>", line 29, in <module>
    df4_resample = df.groupby('item').resample('2s').agg({'A': {'A_mean': np.mean, 'A_max': np.max}})

  File "C:\Anaconda2\lib\site-packages\pandas\tseries\resample.py", line 293, in aggregate
  result, how = self._aggregate(arg, *args, **kwargs)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 505, in _aggregate
    result = list(_agg(arg, _agg_1dim).values())

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 496, in _agg
    result[fname] = func(fname, agg_how)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 479, in _agg_1dim
    return colg.aggregate(how, _level=(_level or 0) + 1)

  File "C:\Anaconda2\lib\site-packages\pandas\tseries\resample.py", line 293, in aggregate
    result, how = self._aggregate(arg, *args, **kwargs)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 528, in _aggregate
  result = _agg(arg, lambda fname,

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 496, in _agg
     result[fname] = func(fname, agg_how)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 529, in <lambda>
agg_how: _agg_1dim(self._selection, agg_how))

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 475, in _agg_1dim
  colg = self._gotitem(name, ndim=1, subset=subset)

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 680, in _gotitem
  groupby=self._groupby[key],

  File "C:\Anaconda2\lib\site-packages\pandas\core\base.py", line 326, in __getitem__
    raise Exception('Column(s) %s already selected' % self._selection)

  Exception: Column(s) A already selected

1 个答案:

答案 0 :(得分:0)

我不确定为什么resample不起作用,但有一个方便的解决方法,不需要使用lambda。试一试:

df.groupby([
    'item', pd.Grouper(freq = '2s')
]).agg({
    'A' : ['mean', 'max']
}).rename(columns = {
    'mean' : 'A_mean', 'max' : 'A_max'
}, level = 1).reset_index()

output

您可以将.resample('2S')添加到pd.Grouper('2s'),而不是使用groupby()。它的功能与您的情况相同。这是文档 - &gt; http://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.Grouper.html

另一方面,您应该避免使用嵌套字典重命名列(不推荐使用它),而是使用实际的.rename()函数。