我有一个我想要重新采样的数据框。数据框中的一列包含时区感知日期时间,我希望重新采样的数据帧包含样本期间的日期时间的最大值。我期望重采样数据帧中的值仍然是时区感知的。
对于时间序列,这有效:
import pytz
from datetime import datetime
from random import randint
rng = pd.date_range('20150101',periods=30,freq='D')
ts=pd.Series([datetime(2014,1,1, tzinfo=pytz.UTC) + timedelta(days=randint(1,365)) for item in rng], index=rng)
>>> ts.resample('W', how='max')
2015-01-04 2014-10-09 00:00:00+00:00
2015-01-11 2014-12-14 00:00:00+00:00
2015-01-18 2014-12-18 00:00:00+00:00
2015-01-25 2014-10-18 00:00:00+00:00
2015-02-01 2014-11-21 00:00:00+00:00
但是,如果我对数据框做同样的事情,我会遇到问题:
df=pd.DataFrame([(datetime(2014,1,1, tzinfo=pytz.UTC)+timedelta(days=randint(1,365))) for item in rng], index=rng)
df.columns=['last_seen']
>>> df.head()
last_seen
2015-01-01 2014-11-15 00:00:00+00:00
2015-01-02 2014-11-13 00:00:00+00:00
2015-01-03 2014-03-06 00:00:00+00:00
2015-01-04 2014-11-29 00:00:00+00:00
2015-01-05 2014-11-12 00:00:00+00:00
>>> df.resample('W', how='max')
last_seen
2015-01-04 2014-11-29
2015-01-11 2014-12-29
2015-01-18 2014-12-13
2015-01-25 2014-11-03
2015-02-01 2014-12-21
>>> df.resample('W', how={'last_seen': 'max'})
last_seen
2015-01-04 2014-11-29
2015-01-11 2014-12-29
2015-01-18 2014-12-13
2015-01-25 2014-11-03
2015-02-01 2014-12-21
这实际上是我尝试做的一个简化示例,实际上是按一列分组,同时重新采样另外两个具有不同功能的列:
df=pd.DataFrame([(randint(1,4), randint(1,100), datetime(2014,1,1, tzinfo=pytz.UTC)+timedelta(days=randint(1,365))) for item in rng], index=rng)
df.columns=('id', 'val', 'last_seen')
df.groupby('id').resample('W', how={'last_seen': 'max', 'val': 'sum'})
val last_seen
id
1 2015-01-04 56 2014-04-01 00:00:00
2015-01-11 44 2014-03-09 00:00:00
2015-01-18 43 2014-11-02 00:00:00
2015-01-25 135 2014-12-19 00:00:00
2015-02-01 230 2014-11-12 00:00:00
2 2015-01-04 3 2014-11-30 00:00:00
2015-01-11 139 2014-12-09 00:00:00
2015-01-18 118 2014-07-02 00:00:00
3 2015-01-04 48 2014-08-26 00:00:00
2015-01-11 77 2014-08-03 00:00:00
2015-01-18 173 2014-12-19 00:00:00
2015-01-25 158 2014-11-25 00:00:00
4 2015-01-04 17 2014-01-13 00:00:00+00:00
2015-01-11 147 2014-02-07 00:00:00+00:00
2015-01-18 86 2014-04-03 00:00:00+00:00
2015-01-25 NaN NaN
2015-02-01 70 2014-07-27 00:00:00+00:00
正如您所看到的,某些重新采样的last_seen
值仍然是时区感知的,但大多数值都不是。
我们可以查看实际记录,看看是否提供了解释,但我看不到任何内容:
>>> df[df['id']==3].sort_index()
id val last_seen
2015-01-03 3 48 2014-08-26 00:00:00+00:00
2015-01-06 3 77 2014-08-03 00:00:00+00:00
2015-01-12 3 84 2014-02-10 00:00:00+00:00
2015-01-16 3 89 2014-12-19 00:00:00+00:00
2015-01-19 3 91 2014-08-30 00:00:00+00:00
2015-01-20 3 49 2014-04-17 00:00:00+00:00
2015-01-21 3 4 2014-03-08 00:00:00+00:00
2015-01-24 3 14 2014-11-25 00:00:00+00:00
>>> df[df['id']==4].sort_index()
id val last_seen
2015-01-01 4 17 2014-01-13 00:00:00+00:00
2015-01-08 4 54 2014-02-07 00:00:00+00:00
2015-01-09 4 93 2014-01-20 00:00:00+00:00
2015-01-18 4 86 2014-04-03 00:00:00+00:00
2015-01-29 4 70 2014-07-27 00:00:00+00:00
我也尝试使用lambda来添加tzinfo,但它仍然不起作用,所以我猜tzinfo在以后的过程中丢失了:
df.groupby('id').resample('W', how={'last_seen': lambda x: max(x).replace(tzinfo=pytz.UTC), 'val': 'sum'})