Question

我在StackOverflow上的第一个问题。到目前为止，我一直能够通过搜索找到问题的答案。希望不要通过提出重复的问题来使自己感到尴尬。

我正在重新采样一个pandas数据帧。然后我想循环遍历重采样器对象中的数据帧以提取一些信息。

但是，当我使用resampler.groups.keys（）返回的密钥时，如果没有该周的数据，我会收到一个密钥错误。这似乎与我不一致。我原本希望得到一个空的数据帧或者keys（）方法，或者根本没有为那个星期的小组获取密钥。

import pandas as pd

df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)

by_week = df.resample('W-SUN')
by_week.groups

给出：

{Timestamp('2017-02-26 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-05 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-12 00:00:00', offset='W-SUN'): 1,
 Timestamp('2017-03-19 00:00:00', offset='W-SUN'): 8}

然后总结只显示两周中间没有数据：

print by_week.sum()

                   ID    DATA
DATETIME                     
2017-02-26  1020754.0    74.0
2017-03-05        NaN     NaN
2017-03-12        NaN     NaN
2017-03-19  7151408.0  2526.0

显示重新采样器组的键：

for key in sorted(by_week.groups.keys(), reverse=True):
    print key

2017-03-19 00:00:00
2017-03-12 00:00:00
2017-03-05 00:00:00
2017-02-26 00:00:00

尝试为每个组数据帧执行某些操作。第一周很好，但第二周就是疯了。为什么keys（）方法返回无效密钥？

for key in sorted(by_week.groups.keys(), reverse=True):
    df = by_week.get_group(key)
    print df.head()

                              ID  DATA
DATETIME                              
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188
2017-03-19 01:02:23.554  1021633   373


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-a57723281f49> in <module>()
      1 for key in sorted(by_week.groups.keys(), reverse=True):
----> 2     df = by_week.get_group(key)
      3     print df.head()

//anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in get_group(self, name, obj)
    585         inds = self._get_index(name)
    586         if not len(inds):
--> 587             raise KeyError(name)
    588 
    589         return obj.take(inds, axis=self.axis, convert=False)

KeyError: Timestamp('2017-03-12 00:00:00', offset='W-SUN')

我的解决方法如下。如果有更合适的方法来处理这个问题，也欢迎任何反馈。中间两周没有数据就跳过了。是否有一种从根本上更好的方法来迭代每周的数据？

for key in sorted(by_week.groups.keys(), reverse=True):
    try:
        df = by_week.get_group(key)
    except:
        continue
    print df.head()

                              ID  DATA
DATETIME                              
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188
2017-03-19 01:02:23.554  1021633   373
                              ID  DATA
DATETIME                              
2017-02-21 13:42:01.133  1020754    74

编辑/更新：解决下面关于使用内置迭代器的响应。我的原始代码确实使用了内置的迭代器，但我得到了它。

import pandas as pd
df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)
by_week = df.resample('W-SUN')

for key, df in by_week:
    print df.head()

给出：

Traceback (most recent call last):
  File "debug_sampler.py", line 10, in <module>
    for key, df in by_week:
  File "<redacted path>/pandas/core/groupby.py", line 600, in __iter__
    return self.grouper.get_iterator(self.obj, axis=self.axis)
AttributeError: 'NoneType' object has no attribute 'get_iterator'

有趣的是，如果我使用groupby，那很好。但是我讨厌放弃重新采样方法的便利性（例如，按照在aribtrary日结束的一周重新采样）。

import pandas as pd
df = pd.read_csv('debug.csv', index_col = 'DATETIME', parse_dates=True)

by_week_groupby = df.groupby(lambda x: x.week)

for key, df in by_week_groupby:
    print df.head()

给出：

                              ID  DATA
DATETIME                              
2017-02-21 13:42:01.133  1020754    74
                              ID  DATA
DATETIME                              
2017-03-19 17:01:01.352  1021625   428
2017-03-18 22:41:10.859  1021626   384
2017-03-18 23:45:18.773  1021627   375
2017-03-18 23:45:35.309  1021628   359
2017-03-18 23:46:45.303  1021629   188

已安装的pandas版本：

print pd.__version__
0.18.1

Answer 1

当pandas已经有一个（虽然不是很明显）时，不要强迫你自己通过groupby对象进行迭代

for key, df in byweek:
    print(df.head())

Pandas重新采样器键错误

1 个答案: