我的简化数据如下所示:
set.seed(1453); x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
df = data.frame(x, date = as.Date(date))
df
x date
1 2016-01-01
0 2016-01-05
1 2016-01-07
0 2016-01-12
0 2016-01-16
1 2016-01-20
1 2016-01-20
0 2016-01-25
0 2016-01-26
1 2016-01-31
我想计算指定时间段内x == 1
的出现次数,例如从当前日期开始的14天和30天(但不包括当前条目,如果是x == 1
。所需的输出将如下所示:
solution
x date x_plus14 x_plus30
1 2016-01-01 1 3
0 2016-01-05 1 4
1 2016-01-07 2 3
0 2016-01-12 2 3
0 2016-01-16 2 3
1 2016-01-20 2 2
1 2016-01-20 1 1
0 2016-01-25 1 1
0 2016-01-26 1 1
1 2016-01-31 0 0
理想情况下,我希望将其放在dplyr
中,但这不是必须的。任何想法如何实现这一目标?非常感谢你的帮助!
答案 0 :(得分:5)
添加基于findInterval
的另一种方法:
cs = cumsum(df$x) # cumulative number of occurences
data.frame(df,
plus14 = cs[findInterval(df$date + 14, df$date, left.open = TRUE)] - cs,
plus30 = cs[findInterval(df$date + 30, df$date, left.open = TRUE)] - cs)
# x date plus14 plus30
#1 1 2016-01-01 1 3
#2 0 2016-01-05 1 4
#3 1 2016-01-07 2 3
#4 0 2016-01-12 2 3
#5 0 2016-01-16 2 3
#6 1 2016-01-20 2 2
#7 1 2016-01-20 1 1
#8 0 2016-01-25 1 1
#9 0 2016-01-26 1 1
#10 1 2016-01-31 0 0
答案 1 :(得分:4)
早些时候我没有包括现在的日期,因此数字不匹配。
library(data.table)
setDT(df)[, `:=`(x14 = sum(df$x[between(df$date, date, date + 14, incbounds = FALSE)]),
x30 = sum(df$x[between(df$date, date, date + 30, incbounds = FALSE)])),
by = date]
# x date x14 x30
# 1: 1 2016-01-01 1 3
# 2: 0 2016-01-05 1 4
# 3: 1 2016-01-07 2 3
# 4: 0 2016-01-12 2 3
# 5: 0 2016-01-16 2 3
# 6: 1 2016-01-20 1 1
# 7: 1 2016-01-20 1 1
# 8: 0 2016-01-25 1 1
# 9: 0 2016-01-26 1 1
# 10: 1 2016-01-31 0 0
或适用于任何所需范围的通用解决方案
vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) :=
lapply(vec, function(i) sum(df$x[between(df$date,
date,
date + i,
incbounds = FALSE)])),
by = date]
答案 2 :(得分:2)
通过一些dplyr
+ purrr
帮助,我在这里捅了一下。由于帮助函数<=
中的>=
和x_next()
,如果您正确调整它们,我的计数会略有不同我认为您应该能够得到您想要的结果。第h
library("tidyverse")
library("lubridate")
set.seed(1453)
x = sample(0:1, 10, TRUE)
dates = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
df = data_frame(x = x, dates = lubridate::as_date(dates))
# helper function to calculate the sum of xs in the next days_in_future
x_next <- function(d, days_in_future) {
df %>%
# subset on days of interest
filter(dates > d & dates <= d + days(days_in_future)) %>%
# sum up xs
summarise(sum = sum(x)) %>%
# have to unlist them so that the (following) call to mutate works
unlist(use.names=F)
}
# mutate your df
df %>%
mutate(xplus14 = map(dates, x_next, 14),
xplus30 = map(dates, x_next, 30))
答案 3 :(得分:2)
简洁的dplyr
和purrr
解决方案:
library(tidyverse)
sample %>%
mutate(x_plus14 = map(date, ~sum(x == 1 & between(date, . + 1, . + 14))),
x_plus30 = map(date, ~sum(x == 1 & between(date, . + 1, . + 30))))
x date x_plus14 x_plus30 1 1 2016-01-01 1 4 2 0 2016-01-05 1 4 3 1 2016-01-07 2 3 4 0 2016-01-12 2 3 5 0 2016-01-16 2 3 6 1 2016-01-20 1 1 7 1 2016-01-20 1 1 8 0 2016-01-25 1 1 9 0 2016-01-26 1 1 10 1 2016-01-31 0 0
答案 4 :(得分:1)
正如其他已经提到的那样,奇怪的是你不计算日期,你应该避免按功能名称(样本)命名对象。但是,下面的代码会重现您想要的输出:
set.seed(1453);
x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07', '2016-01-12', '2016-01-16', '2016-01-20',
'2016-01-20', '2016-01-25', '2016-01-26', '2016-01-31')
sample = data.frame(x = x, date = as.Date(sample$date))
getOccurences <- function(one_row, sample_data, date_range){
one_date <- as.Date(one_row[2])
sum(sample$x[sample_data$date > one_date &
sample_data$date < one_date + date_range])
}
sample$x_plus14 <- apply(sample,1,getOccurences, sample, 14)
sample$x_plus30 <- apply(sample,1,getOccurences, sample, 30)
sample
x date x_plus14 x_plus30
1 1 2016-01-01 1 3
2 0 2016-01-05 1 4
3 1 2016-01-07 2 3
4 0 2016-01-12 2 3
5 0 2016-01-16 2 3
6 1 2016-01-20 1 1
7 1 2016-01-20 1 1
8 0 2016-01-25 1 1
9 0 2016-01-26 1 1
10 1 2016-01-31 0 0