用agg重新采样不同的行为并调用函数

时间:2018-12-13 23:19:23

标签: pandas

我有一个名为pandas.DataFrame的{​​{1}},具有以下结构:

data

我想每分钟进行一次重新采样,并采取不同的上采样策略。在 id action date 1900-11-01 00:00:00 10.0 starts_game 1900-11-01 00:05:00 10.0 team_a_scores 1900-11-01 00:25:00 10.0 team_a_scores 1900-11-01 00:30:00 10.0 team_a_scores 1900-11-01 00:55:00 10.0 team_b_scores 1900-11-01 23:58:00 99.0 starts_game 1900-11-02 00:40:00 99.0 team_b_scores 1900-11-02 00:50:00 99.0 team_b_scores 1900-11-03 00:05:00 10.0 starts_game 1900-11-03 00:24:00 10.0 team_b_scores 列中,我将对其进行填充,而在id列中,我将仅使用“播放”填充上采样的值。

问题是,当我直接对重新采样的数据帧进行填充并使用agg函数时,结果会有所不同,让我们来看一下:

action

但是请记住,我希望data.resample('T').ffill().head() id action date 1900-11-01 00:00:00 10.0 starts_game 1900-11-01 00:01:00 10.0 starts_game 1900-11-01 00:02:00 10.0 starts_game 1900-11-01 00:03:00 10.0 starts_game 1900-11-01 00:04:00 10.0 starts_game 列仅是字符串'playing',所以:

action

我不明白为什么ID无法正确升采样,知道吗?

为便于重现,这是csv:

data.resample('T').agg(dict(id='ffill', action=lambda _: 'playing')).head()



                       id   action
date                              
1900-11-01 00:00:00  10.0  playing
1900-11-01 00:01:00   NaN  playing
1900-11-01 00:02:00   NaN  playing
1900-11-01 00:03:00   NaN  playing
1900-11-01 00:04:00   NaN  playing

和代码:

date,id,action
1900-11-01 00:00:00,10.0,starts_game
1900-11-01 00:05:00,10.0,team_a_scores
1900-11-01 00:25:00,10.0,team_a_scores
1900-11-01 00:30:00,10.0,team_a_scores
1900-11-01 00:55:00,10.0,team_b_scores
1900-11-01 23:58:00,99.0,starts_game
1900-11-02 00:40:00,99.0,team_b_scores
1900-11-02 00:50:00,99.0,team_b_scores
1900-11-03 00:05:00,10.0,starts_game
1900-11-03 00:24:00,10.0,team_b_scores

1 个答案:

答案 0 :(得分:1)

agg不起作用的原因是,resample('T')返回了类似groupby的结构,其中的组是每分钟的行数

>>> data.resample('T').groups
{Timestamp('1900-11-01 00:00:00', freq='T'): 1,
 Timestamp('1900-11-01 00:01:00', freq='T'): 1,
 Timestamp('1900-11-01 00:02:00', freq='T'): 1,
 Timestamp('1900-11-01 00:03:00', freq='T'): 1,
 Timestamp('1900-11-01 00:04:00', freq='T'): 1, ...

agg应用于一个组,在这种情况下,该组仅是一行,这意味着lambda会很高兴地返回标量,而ffill将采用唯一可用的元素。

您是否已通过例如一天

>>> data.resample('D').groups
{Timestamp('1900-11-01 00:00:00', freq='D'): 6,
 Timestamp('1900-11-02 00:00:00', freq='D'): 8,
 Timestamp('1900-11-03 00:00:00', freq='D'): 10}

本来是相反的。您的lambda只会为第一组的全部6个元素返回单个值,但是'ffill'方法将按预期方式工作,将第一个遇到的非NaN值向前传播

>>> data.resample('D').agg({'id': 'ffill', 'action': lambda _: 'playing'})
                       id   action
date                              
1900-11-01 00:00:00  10.0  playing
1900-11-01 00:05:00  10.0      NaN
1900-11-01 00:25:00  10.0      NaN
1900-11-01 00:30:00  10.0      NaN
1900-11-01 00:55:00  10.0      NaN
1900-11-01 23:58:00  99.0      NaN
1900-11-02 00:00:00   NaN  playing
1900-11-02 00:40:00  99.0      NaN
1900-11-02 00:50:00  99.0      NaN
1900-11-03 00:00:00   NaN  playing
1900-11-03 00:05:00  10.0      NaN
1900-11-03 00:24:00  10.0      NaN

我不确定是否可以一次性完成整个操作,但以下操作应该可以

df = data.resample('T').first()
df['id'] = df['id'].ffill()
df['action'] = df['action'].fillna('playing')

给你

                       id         action
date                                    
1900-11-01 00:00:00  10.0    starts_game
1900-11-01 00:01:00  10.0        playing
1900-11-01 00:02:00  10.0        playing
1900-11-01 00:03:00  10.0        playing
1900-11-01 00:04:00  10.0        playing
1900-11-01 00:05:00  10.0  team_a_scores
1900-11-01 00:06:00  10.0        playing
1900-11-01 00:07:00  10.0        playing

更新

您可以使用asfreq代替resample,它返回一个普通的DataFrame并按照您期望的方式运行

>>> data.asfreq('T').agg({'id': 'ffill', 'action': lambda _: 'playing'})
                       id   action
date                              
1900-11-01 00:00:00  10.0  playing
1900-11-01 00:01:00  10.0  playing
1900-11-01 00:02:00  10.0  playing
1900-11-01 00:03:00  10.0  playing
1900-11-01 00:04:00  10.0  playing

将上述解决方案更改为

df = data.asfreq('T')
df['id'] = df['id'].ffill()
df['action'] = df['action'].fillna('playing')