Question

我有一个Pandas DataFrame，其列是一个tz感知的TimeStamp，我尝试了groupby（level = 0）.first（）。我的结果不正确。我错过了什么或是熊猫虫吗？

x = pd.DataFrame(index = [1,1,2,2,2], data = pd.date_range("7:00", "9:00", freq="30min", tz = 'US/Eastern'))

In [58]: x
Out[58]: 


     0
1 2016-09-08 07:00:00-04:00
1 2016-09-08 07:30:00-04:00
2 2016-09-08 08:00:00-04:00
2 2016-09-08 08:30:00-04:00
2 2016-09-08 09:00:00-04:00

In [59]: x.groupby(level=0).first()
Out[59]: 
                          0
1 2016-09-08 11:00:00-04:00
2 2016-09-08 12:00:00-04:00

Answer 1

我不相信这是一个错误。如果您浏览docs文档，则会明确指出对于时区US / Eastern，无法在夏令时结束时间之前/之后指定。

在这种情况下，坚持使用UTC似乎是最好的选择。

摘自pytz：

 Be aware that timezones (e.g., pytz.timezone('US/Eastern')) are not
 necessarily equal across timezone versions. So if data is localized to
 a specific timezone in the HDFStore using one version of a timezone
 library and that data is updated with another version, the data will
 be converted to UTC since these timezones are not considered equal.
 Either use the same version of timezone library or use tz_convert with
 the updated timezone definition.

转换可以按如下方式进行：

A：使用tz_localize方法将天真/时间感知日期时间本地化为UTC

data = pd.date_range("7:00", "9:00", freq="30min").tz_localize('UTC')

B：使用tz_convert方法转换pandas对象进行转换 tz将数据识别到另一个时区。

df = pd.DataFrame(index=[1,1,2,2,2], data=data.tz_convert('US/Eastern'))
df.groupby(level=0).first()

导致：

                          0
1 2016-09-09 07:00:00-04:00
2 2016-09-09 08:00:00-04:00

#0    datetime64[ns, US/Eastern]
#dtype: object

Answer 2

这实际上是这里报告的一个大熊猫bug：

https://github.com/pydata/pandas/issues/10668

Pandas groupby datatime index，可能是bug

2 个答案: