我有一个这样的表格(电子邮件在这里简化为一个字母):
timestamp | email
2018-10-17 13:00:00+00:00 | m
2018-10-17 13:00:00+00:00 | m
2018-10-17 13:00:10+00:00 |
2018-10-17 13:00:10+00:00 | v
2018-10-17 13:00:30+00:00 |
2018-10-17 13:00:30+00:00 | c
2018-10-17 13:00:50+00:00 | p
2018-10-17 13:01:00+00:00 |
2018-10-17 13:01:00+00:00 | m
2018-10-17 13:01:00+00:00 | s
2018-10-17 13:01:00+00:00 | b
现在,我想创建一个新列,例如,该列将计算电子邮件在进入条目之前的最后30秒内重复的次数。
timestamp | email | count | comment
2018-10-17 13:00:00+00:00 | m | 1 |
2018-10-17 13:00:00+00:00 | m | 2 | (there were 2 entries in the last 30s)
2018-10-17 13:00:10+00:00 | | 1 | (empty we count as well)
2018-10-17 13:00:10+00:00 | v | 1 |
2018-10-17 13:00:30+00:00 | | 2 | (counting the empty like emails)
2018-10-17 13:00:30+00:00 | c | 1 |
2018-10-17 13:00:50+00:00 | p | 1 |
2018-10-17 13:01:00+00:00 | | 2 | (in the last 30s from this ts, we have 2)
2018-10-17 13:01:00+00:00 | m | 1 | (the first 2 m happened before the last 30s)
2018-10-17 13:01:00+00:00 | s | 1 |
2018-10-17 13:01:00+00:00 | b | 1 |
时间戳是一个dateTime对象
timestamp datetime64[ns, UTC]
此外,它是索引并已排序。 我首先尝试了以下命令:
df['email'].groupby(df.email).rolling('120s').count().values
但是它不适用于字符串,因此我使用以下命令将其转换为唯一数字:
full_df['email'].factorize()
但是结果似乎并不正确:
timestamp | email | count | comment
2018-10-17 13:00:00+00:00 | m | 1 |
2018-10-17 13:00:00+00:00 | m | 2 |
2018-10-17 13:00:10+00:00 | | 1 |
2018-10-17 13:00:10+00:00 | v | 2 | (No ideia about this result)
2018-10-17 13:00:30+00:00 | | 3 | (Appears to just keeping count)
2018-10-17 13:00:30+00:00 | c | 1 | (Then just go back to 1 again... )
2018-10-17 13:00:50+00:00 | p | 2 |
2018-10-17 13:01:00+00:00 | | 3 |
2018-10-17 13:01:00+00:00 | m | 4 |
2018-10-17 13:01:00+00:00 | s | 1 |
2018-10-17 13:01:00+00:00 | b | 1 |
任何观念上我做错了什么,我怎么能得到我想要得到的东西?
非常感谢, 乔奥
答案 0 :(得分:1)
您可以在library(datasets)
library(quantregForest)
library(dplyr)
library(ggplot2)
x <- iris %>%
dplyr::select(
Petal.Length
) %>%
data.matrix()
y <- iris %>%
dplyr::select(
Petal.Width
) %>%
data.matrix()
model <- quantregForest(
x = x
, y = y
)
what <- c(
0
, 0.1
, 0.2
, 0.3
, 0.4
, 0.5
, 0.6
, 0.7
, 0.8
, 0.9
, 1
)
predictions <- predict(model, data.matrix(iris[1,]), what = what)
df <- as.data.frame(as.table(t(predictions)))
ggplot(df, aes(Freq)) +
geom_density()
之后使用apply
来计算窗口的最后一个元素在窗口中显示的次数,如下所示:
rolling