如何计算滚动窗口中数据框列中相同实例的数量

时间:2017-09-11 20:13:37

标签: python pandas numpy machine-learning data-mining

我试图在每个滑动窗口内为这些数据计算一些相同的ID:

DATE                    ID      s_2_count    s_3_count   s_5_count       
2017-05-17 15:49:51     s_2         2            0         1 
2017-05-17 15:49:52     s_5         1            1         1   
2017-05-17 15:49:55     s_2         1            1         1   
2017-05-17 15:49:56     s_3         0            1         2   
2017-05-17 15:49:58     s_5         NaN          NaN       NaN
2017-05-17 15:49:59     s_5         NaN          NaN       NaN

我正在尝试计算大小为3的滚动窗口内相同ID的数量,它们相互重叠。答案应该是这样的:

category

1 个答案:

答案 0 :(得分:2)

使用str.get_dummiesrollingsumshiftadd_prefix

df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count')

输出:

                     s_2_count  s_3_count  s_5_count
DATE                                                
2017-05-17 15:49:51        2.0        0.0        1.0
2017-05-17 15:49:52        1.0        1.0        1.0
2017-05-17 15:49:55        1.0        1.0        1.0
2017-05-17 15:49:56        0.0        1.0        2.0
2017-05-17 15:49:58        NaN        NaN        NaN
2017-05-17 15:49:59        NaN        NaN        NaN

让我们将其分配回数据帧:

df.assign(**df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))

或使用联接

df.join(df.ID.str.get_dummies().rolling(3).sum().shift(-2).add_suffix('_count'))

输出:

                      ID  s_2_count  s_3_count  s_5_count
DATE                                                     
2017-05-17 15:49:51  s_2        2.0        0.0        1.0
2017-05-17 15:49:52  s_5        1.0        1.0        1.0
2017-05-17 15:49:55  s_2        1.0        1.0        1.0
2017-05-17 15:49:56  s_3        0.0        1.0        2.0
2017-05-17 15:49:58  s_5        NaN        NaN        NaN
2017-05-17 15:49:59  s_5        NaN        NaN        NaN

选项2使用pd.crosstab

df.assign(**pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))

或使用加入

df.join(pd.crosstab(df.index,df.ID).rolling(3).sum().shift(-2))