如何在不丢失tzinfo的情况下重新采样包含日期时间的Pandas数据帧

时间:2015-08-11 12:03:31

标签: python pandas

我有一个我想要重新采样的数据框。数据框中的一列包含时区感知日期时间,我希望重新采样的数据帧包含样本期间的日期时间的最大值。我期望重采样数据帧中的值仍然是时区感知的。

对于时间序列,这有效:

import pytz
from datetime import datetime
from random import randint
rng = pd.date_range('20150101',periods=30,freq='D')

ts=pd.Series([datetime(2014,1,1, tzinfo=pytz.UTC) + timedelta(days=randint(1,365)) for item in rng], index=rng)

>>> ts.resample('W', how='max')
2015-01-04    2014-10-09 00:00:00+00:00
2015-01-11    2014-12-14 00:00:00+00:00
2015-01-18    2014-12-18 00:00:00+00:00
2015-01-25    2014-10-18 00:00:00+00:00
2015-02-01    2014-11-21 00:00:00+00:00

但是,如果我对数据框做同样的事情,我会遇到问题:

df=pd.DataFrame([(datetime(2014,1,1, tzinfo=pytz.UTC)+timedelta(days=randint(1,365))) for item in rng], index=rng)
df.columns=['last_seen']

>>> df.head()
                            last_seen
2015-01-01  2014-11-15 00:00:00+00:00
2015-01-02  2014-11-13 00:00:00+00:00
2015-01-03  2014-03-06 00:00:00+00:00
2015-01-04  2014-11-29 00:00:00+00:00
2015-01-05  2014-11-12 00:00:00+00:00

>>> df.resample('W', how='max')
            last_seen
2015-01-04 2014-11-29
2015-01-11 2014-12-29
2015-01-18 2014-12-13
2015-01-25 2014-11-03
2015-02-01 2014-12-21

>>> df.resample('W', how={'last_seen': 'max'})
            last_seen
2015-01-04 2014-11-29
2015-01-11 2014-12-29
2015-01-18 2014-12-13
2015-01-25 2014-11-03
2015-02-01 2014-12-21

这实际上是我尝试做的一个简化示例,实际上是按一列分组,同时重新采样另外两个具有不同功能的列:

df=pd.DataFrame([(randint(1,4), randint(1,100), datetime(2014,1,1, tzinfo=pytz.UTC)+timedelta(days=randint(1,365))) for item in rng], index=rng)
df.columns=('id', 'val', 'last_seen')
df.groupby('id').resample('W', how={'last_seen': 'max', 'val': 'sum'})
               val                  last_seen
id                                           
1  2015-01-04   56        2014-04-01 00:00:00
   2015-01-11   44        2014-03-09 00:00:00
   2015-01-18   43        2014-11-02 00:00:00
   2015-01-25  135        2014-12-19 00:00:00
   2015-02-01  230        2014-11-12 00:00:00
2  2015-01-04    3        2014-11-30 00:00:00
   2015-01-11  139        2014-12-09 00:00:00
   2015-01-18  118        2014-07-02 00:00:00
3  2015-01-04   48        2014-08-26 00:00:00
   2015-01-11   77        2014-08-03 00:00:00
   2015-01-18  173        2014-12-19 00:00:00
   2015-01-25  158        2014-11-25 00:00:00
4  2015-01-04   17  2014-01-13 00:00:00+00:00
   2015-01-11  147  2014-02-07 00:00:00+00:00
   2015-01-18   86  2014-04-03 00:00:00+00:00
   2015-01-25  NaN                        NaN
   2015-02-01   70  2014-07-27 00:00:00+00:00

正如您所看到的,某些重新采样的last_seen值仍然是时区感知的,但大多数值都不是。

我们可以查看实际记录,看看是否提供了解释,但我看不到任何内容:

>>> df[df['id']==3].sort_index()
            id  val                  last_seen
2015-01-03   3   48  2014-08-26 00:00:00+00:00
2015-01-06   3   77  2014-08-03 00:00:00+00:00
2015-01-12   3   84  2014-02-10 00:00:00+00:00
2015-01-16   3   89  2014-12-19 00:00:00+00:00
2015-01-19   3   91  2014-08-30 00:00:00+00:00
2015-01-20   3   49  2014-04-17 00:00:00+00:00
2015-01-21   3    4  2014-03-08 00:00:00+00:00
2015-01-24   3   14  2014-11-25 00:00:00+00:00
>>> df[df['id']==4].sort_index()
            id  val                  last_seen
2015-01-01   4   17  2014-01-13 00:00:00+00:00
2015-01-08   4   54  2014-02-07 00:00:00+00:00
2015-01-09   4   93  2014-01-20 00:00:00+00:00
2015-01-18   4   86  2014-04-03 00:00:00+00:00
2015-01-29   4   70  2014-07-27 00:00:00+00:00

我也尝试使用lambda来添加tzinfo,但它仍然不起作用,所以我猜tzinfo在以后的过程中丢失了:

df.groupby('id').resample('W', how={'last_seen': lambda x: max(x).replace(tzinfo=pytz.UTC), 'val': 'sum'})

0 个答案:

没有答案