熊猫:使用滚动计数

时间:2018-11-14 15:22:17

标签: python pandas jupyter

我有一个这样的表格(电子邮件在这里简化为一个字母):

timestamp                  | email
2018-10-17 13:00:00+00:00  | m
2018-10-17 13:00:00+00:00  | m
2018-10-17 13:00:10+00:00  | 
2018-10-17 13:00:10+00:00  | v
2018-10-17 13:00:30+00:00  |  
2018-10-17 13:00:30+00:00  | c
2018-10-17 13:00:50+00:00  | p
2018-10-17 13:01:00+00:00  |  
2018-10-17 13:01:00+00:00  | m
2018-10-17 13:01:00+00:00  | s
2018-10-17 13:01:00+00:00  | b

现在,我想创建一个新列,例如,该列将计算电子邮件在进入条目之前的最后30秒内重复的次数。

timestamp                  | email | count | comment
2018-10-17 13:00:00+00:00  | m     |   1   |
2018-10-17 13:00:00+00:00  | m     |   2   | (there were 2 entries in the last 30s)
2018-10-17 13:00:10+00:00  |       |   1   | (empty we count as well)
2018-10-17 13:00:10+00:00  | v     |   1   |
2018-10-17 13:00:30+00:00  |       |   2   | (counting the empty like emails)
2018-10-17 13:00:30+00:00  | c     |   1   | 
2018-10-17 13:00:50+00:00  | p     |   1   |
2018-10-17 13:01:00+00:00  |       |   2   | (in the last 30s from this ts, we have 2)
2018-10-17 13:01:00+00:00  | m     |   1   | (the first 2 m happened before the last 30s)
2018-10-17 13:01:00+00:00  | s     |   1   |
2018-10-17 13:01:00+00:00  | b     |   1   |

时间戳是一个dateTime对象

timestamp          datetime64[ns, UTC]

此外,它是索引并已排序。 我首先尝试了以下命令:

df['email'].groupby(df.email).rolling('120s').count().values

但是它不适用于字符串,因此我使用以下命令将其转换为唯一数字:

full_df['email'].factorize()

但是结果似乎并不正确:

timestamp                  | email | count | comment
2018-10-17 13:00:00+00:00  | m     |   1   |  
2018-10-17 13:00:00+00:00  | m     |   2   | 
2018-10-17 13:00:10+00:00  |       |   1   | 
2018-10-17 13:00:10+00:00  | v     |   2   |  (No ideia about this result)
2018-10-17 13:00:30+00:00  |       |   3   | (Appears to just keeping count)
2018-10-17 13:00:30+00:00  | c     |   1   |  (Then just go back to 1 again... )
2018-10-17 13:00:50+00:00  | p     |   2   |
2018-10-17 13:01:00+00:00  |       |   3   | 
2018-10-17 13:01:00+00:00  | m     |   4   | 
2018-10-17 13:01:00+00:00  | s     |   1   |
2018-10-17 13:01:00+00:00  | b     |   1   |

任何观念上我做错了什么,我怎么能得到我想要得到的东西?

非常感谢, 乔奥

1 个答案:

答案 0 :(得分:1)

您可以在library(datasets) library(quantregForest) library(dplyr) library(ggplot2) x <- iris %>% dplyr::select( Petal.Length ) %>% data.matrix() y <- iris %>% dplyr::select( Petal.Width ) %>% data.matrix() model <- quantregForest( x = x , y = y ) what <- c( 0 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 1 ) predictions <- predict(model, data.matrix(iris[1,]), what = what) df <- as.data.frame(as.table(t(predictions))) ggplot(df, aes(Freq)) + geom_density() 之后使用apply来计算窗口的最后一个元素在窗口中显示的次数,如下所示:

rolling