我有一个名为pandas.DataFrame
的{{1}},具有以下结构:
data
我想每分钟进行一次重新采样,并采取不同的上采样策略。在 id action
date
1900-11-01 00:00:00 10.0 starts_game
1900-11-01 00:05:00 10.0 team_a_scores
1900-11-01 00:25:00 10.0 team_a_scores
1900-11-01 00:30:00 10.0 team_a_scores
1900-11-01 00:55:00 10.0 team_b_scores
1900-11-01 23:58:00 99.0 starts_game
1900-11-02 00:40:00 99.0 team_b_scores
1900-11-02 00:50:00 99.0 team_b_scores
1900-11-03 00:05:00 10.0 starts_game
1900-11-03 00:24:00 10.0 team_b_scores
列中,我将对其进行填充,而在id
列中,我将仅使用“播放”填充上采样的值。
问题是,当我直接对重新采样的数据帧进行填充并使用agg函数时,结果会有所不同,让我们来看一下:
action
但是请记住,我希望data.resample('T').ffill().head()
id action
date
1900-11-01 00:00:00 10.0 starts_game
1900-11-01 00:01:00 10.0 starts_game
1900-11-01 00:02:00 10.0 starts_game
1900-11-01 00:03:00 10.0 starts_game
1900-11-01 00:04:00 10.0 starts_game
列仅是字符串'playing',所以:
action
我不明白为什么ID无法正确升采样,知道吗?
为便于重现,这是csv:
data.resample('T').agg(dict(id='ffill', action=lambda _: 'playing')).head()
id action
date
1900-11-01 00:00:00 10.0 playing
1900-11-01 00:01:00 NaN playing
1900-11-01 00:02:00 NaN playing
1900-11-01 00:03:00 NaN playing
1900-11-01 00:04:00 NaN playing
和代码:
date,id,action
1900-11-01 00:00:00,10.0,starts_game
1900-11-01 00:05:00,10.0,team_a_scores
1900-11-01 00:25:00,10.0,team_a_scores
1900-11-01 00:30:00,10.0,team_a_scores
1900-11-01 00:55:00,10.0,team_b_scores
1900-11-01 23:58:00,99.0,starts_game
1900-11-02 00:40:00,99.0,team_b_scores
1900-11-02 00:50:00,99.0,team_b_scores
1900-11-03 00:05:00,10.0,starts_game
1900-11-03 00:24:00,10.0,team_b_scores
答案 0 :(得分:1)
agg
不起作用的原因是,resample('T')
返回了类似groupby
的结构,其中的组是每分钟的行数
>>> data.resample('T').groups
{Timestamp('1900-11-01 00:00:00', freq='T'): 1,
Timestamp('1900-11-01 00:01:00', freq='T'): 1,
Timestamp('1900-11-01 00:02:00', freq='T'): 1,
Timestamp('1900-11-01 00:03:00', freq='T'): 1,
Timestamp('1900-11-01 00:04:00', freq='T'): 1, ...
agg
应用于一个组,在这种情况下,该组仅是一行,这意味着lambda会很高兴地返回标量,而ffill
将采用唯一可用的元素。
您是否已通过例如一天
>>> data.resample('D').groups
{Timestamp('1900-11-01 00:00:00', freq='D'): 6,
Timestamp('1900-11-02 00:00:00', freq='D'): 8,
Timestamp('1900-11-03 00:00:00', freq='D'): 10}
本来是相反的。您的lambda只会为第一组的全部6个元素返回单个值,但是'ffill'
方法将按预期方式工作,将第一个遇到的非NaN
值向前传播
>>> data.resample('D').agg({'id': 'ffill', 'action': lambda _: 'playing'})
id action
date
1900-11-01 00:00:00 10.0 playing
1900-11-01 00:05:00 10.0 NaN
1900-11-01 00:25:00 10.0 NaN
1900-11-01 00:30:00 10.0 NaN
1900-11-01 00:55:00 10.0 NaN
1900-11-01 23:58:00 99.0 NaN
1900-11-02 00:00:00 NaN playing
1900-11-02 00:40:00 99.0 NaN
1900-11-02 00:50:00 99.0 NaN
1900-11-03 00:00:00 NaN playing
1900-11-03 00:05:00 10.0 NaN
1900-11-03 00:24:00 10.0 NaN
我不确定是否可以一次性完成整个操作,但以下操作应该可以
df = data.resample('T').first()
df['id'] = df['id'].ffill()
df['action'] = df['action'].fillna('playing')
给你
id action
date
1900-11-01 00:00:00 10.0 starts_game
1900-11-01 00:01:00 10.0 playing
1900-11-01 00:02:00 10.0 playing
1900-11-01 00:03:00 10.0 playing
1900-11-01 00:04:00 10.0 playing
1900-11-01 00:05:00 10.0 team_a_scores
1900-11-01 00:06:00 10.0 playing
1900-11-01 00:07:00 10.0 playing
更新
您可以使用asfreq
代替resample
,它返回一个普通的DataFrame并按照您期望的方式运行
>>> data.asfreq('T').agg({'id': 'ffill', 'action': lambda _: 'playing'})
id action
date
1900-11-01 00:00:00 10.0 playing
1900-11-01 00:01:00 10.0 playing
1900-11-01 00:02:00 10.0 playing
1900-11-01 00:03:00 10.0 playing
1900-11-01 00:04:00 10.0 playing
将上述解决方案更改为
df = data.asfreq('T')
df['id'] = df['id'].ffill()
df['action'] = df['action'].fillna('playing')