Question

我的简化数据如下所示：

set.seed(1453); x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
             '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')


df = data.frame(x, date = as.Date(date))


df 
x       date
1 2016-01-01
0 2016-01-05
1 2016-01-07
0 2016-01-12
0 2016-01-16
1 2016-01-20
1 2016-01-20
0 2016-01-25
0 2016-01-26
1 2016-01-31

我想计算指定时间段内x == 1的出现次数，例如从当前日期开始的14天和30天（但不包括当前条目，如果是x == 1。所需的输出将如下所示：

solution
x       date x_plus14 x_plus30
1 2016-01-01        1        3
0 2016-01-05        1        4
1 2016-01-07        2        3
0 2016-01-12        2        3
0 2016-01-16        2        3
1 2016-01-20        2        2
1 2016-01-20        1        1
0 2016-01-25        1        1
0 2016-01-26        1        1
1 2016-01-31        0        0

理想情况下，我希望将其放在dplyr中，但这不是必须的。任何想法如何实现这一目标？非常感谢你的帮助！

Answer 1

添加基于findInterval的另一种方法：

cs = cumsum(df$x) # cumulative number of occurences
data.frame(df, 
           plus14 = cs[findInterval(df$date + 14, df$date, left.open = TRUE)] - cs, 
           plus30 = cs[findInterval(df$date + 30, df$date, left.open = TRUE)] - cs)
#   x       date plus14 plus30
#1  1 2016-01-01      1      3
#2  0 2016-01-05      1      4
#3  1 2016-01-07      2      3
#4  0 2016-01-12      2      3
#5  0 2016-01-16      2      3
#6  1 2016-01-20      2      2
#7  1 2016-01-20      1      1
#8  0 2016-01-25      1      1
#9  0 2016-01-26      1      1
#10 1 2016-01-31      0      0

Answer 2

早些时候我没有包括现在的日期，因此数字不匹配。

library(data.table)
setDT(df)[, `:=`(x14 = sum(df$x[between(df$date, date, date + 14, incbounds = FALSE)]), 
                 x30 = sum(df$x[between(df$date, date, date + 30, incbounds = FALSE)])),
              by = date]

#     x       date x14 x30
#  1: 1 2016-01-01   1   3
#  2: 0 2016-01-05   1   4
#  3: 1 2016-01-07   2   3
#  4: 0 2016-01-12   2   3
#  5: 0 2016-01-16   2   3
#  6: 1 2016-01-20   1   1
#  7: 1 2016-01-20   1   1
#  8: 0 2016-01-25   1   1
#  9: 0 2016-01-26   1   1
# 10: 1 2016-01-31   0   0

或适用于任何所需范围的通用解决方案

vec <- c(14, 30) # Specify desired ranges
setDT(df)[, paste0("x", vec) := 
            lapply(vec, function(i) sum(df$x[between(df$date, 
                                                     date, 
                                                     date + i, 
                                                     incbounds = FALSE)])),
            by = date]

Answer 3

通过一些dplyr + purrr帮助，我在这里捅了一下。由于帮助函数<=中的>=和x_next()，如果您正确调整它们，我的计数会略有不同我认为您应该能够得到您想要的结果。第h

library("tidyverse")
library("lubridate")
set.seed(1453)

x = sample(0:1, 10, TRUE)
dates = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
         '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')

df = data_frame(x = x, dates = lubridate::as_date(dates))

# helper function to calculate the sum of xs in the next days_in_future
x_next <- function(d, days_in_future) {

  df %>% 
    # subset on days of interest
    filter(dates > d & dates <= d + days(days_in_future)) %>% 
    # sum up xs
    summarise(sum = sum(x)) %>% 
    # have to unlist them so that the (following) call to mutate works
    unlist(use.names=F)
  }

# mutate your df
df %>% 
  mutate(xplus14 = map(dates, x_next, 14),
         xplus30 = map(dates, x_next, 30))

Answer 4

简洁的dplyr和purrr解决方案：

library(tidyverse)

sample %>% 
  mutate(x_plus14 = map(date, ~sum(x == 1 & between(date, . + 1, . + 14))),
         x_plus30 = map(date, ~sum(x == 1 & between(date, . + 1, . + 30))))

   x       date x_plus14 x_plus30
1  1 2016-01-01        1        4
2  0 2016-01-05        1        4
3  1 2016-01-07        2        3
4  0 2016-01-12        2        3
5  0 2016-01-16        2        3
6  1 2016-01-20        1        1
7  1 2016-01-20        1        1
8  0 2016-01-25        1        1
9  0 2016-01-26        1        1
10 1 2016-01-31        0        0

Answer 5

正如其他已经提到的那样，奇怪的是你不计算日期，你应该避免按功能名称（样本）命名对象。但是，下面的代码会重现您想要的输出：

set.seed(1453); 
x = sample(0:1, 10, TRUE)
date = c('2016-01-01', '2016-01-05', '2016-01-07',  '2016-01-12',  '2016-01-16',  '2016-01-20',
             '2016-01-20',  '2016-01-25',  '2016-01-26',  '2016-01-31')


sample = data.frame(x = x, date = as.Date(sample$date))

getOccurences <- function(one_row, sample_data, date_range){
  one_date <- as.Date(one_row[2])
  sum(sample$x[sample_data$date > one_date & 
               sample_data$date < one_date + date_range])
}

sample$x_plus14 <- apply(sample,1,getOccurences, sample, 14)
sample$x_plus30 <- apply(sample,1,getOccurences, sample, 30)

sample

   x       date x_plus14 x_plus30
1  1 2016-01-01        1        3
2  0 2016-01-05        1        4
3  1 2016-01-07        2        3
4  0 2016-01-12        2        3
5  0 2016-01-16        2        3
6  1 2016-01-20        1        1
7  1 2016-01-20        1        1
8  0 2016-01-25        1        1
9  0 2016-01-26        1        1
10 1 2016-01-31        0        0

R：计算未来指定时间内特定事件的发生次数

5 个答案: