Question

之前我问了一个类似的问题，得到了很大的帮助：R: Aggregating History By ID By Date

不同之处在于，对于前一篇文章，我有兴趣汇总所有历史信息，但现在我希望仅提前90天指定。

以下是我的数据外观的示例：

strDates <- c("09/09/16", "5/7/16", "5/6/16", "2/13/16", "2/11/16","1/7/16",
          "11/8/16","6/8/16", "5/8/16","2/13/16","1/3/16", "1/1/16")
Date<-as.Date(strDates, "%m/%d/%y")
ID <- c("A", "A", "A", "A","A", "A", "B","B","B","B","B", "B")
Event <- c(1,0,1,0,1,1, 0,1,1,1,0, 1)
sample_df <- data.frame(Date,ID,Event)

以及输出：

背景资料

我想在每次遭遇时保留所有附加信息，但随后将ID的以下历史信息汇总回90天。

过去90天的先前遭遇次数
过去90天内的上一次活动数

示例

举个例子，让我们看看第2行。

第2行是ID A，所以我会引用第3-6行（发生在第2行遇到之前）。在这组行中，我们看到行3,4,5并且都发生在过去90天内，第6行发生在感兴趣的时间之外。

第2行：3次遭遇的最近90天内遭遇的次数

第2行的最近90天的事件数：2事件（2016年5月6日和2月11日）

所需输出

理想情况下，我会得到以下输出：

Answer 1

这是一个非常有效的替代data.table解决方案。这利用了v 1.10.0中引入的新非equi 连接与by = .EACHI相结合，允许您在加入

时每个连接进行计算
library(data.table) #v1.10.0 setDT(sample_df)[, Date2 := Date - 90] # Set range (Maybe in future this could be avoided) sample_df[sample_df, # Binary join with itself .(Enc90D = .N, Ev90D = sum(Event, na.rm = TRUE)), # Make calculations on = .(ID = ID, Date < Date, Date > Date2), # Join by by = .EACHI] # Do calculations per each match # ID Date Date Enc90D Ev90D # 1: A 2016-09-09 2016-06-11 0 0 # 2: A 2016-05-07 2016-02-07 3 2 # 3: A 2016-05-06 2016-02-06 2 1 # 4: A 2016-02-13 2015-11-15 2 2 # 5: A 2016-02-11 2015-11-13 1 1 # 6: A 2016-01-07 2015-10-09 0 0 # 7: B 2016-11-08 2016-08-10 0 0 # 8: B 2016-06-08 2016-03-10 1 1 # 9: B 2016-05-08 2016-02-08 1 1 # 10: B 2016-02-13 2015-11-15 2 1 # 11: B 2016-01-03 2015-10-05 1 1 # 12: B 2016-01-01 2015-10-03 0 0

Answer 2

部分向量化的dplyr解决方案，您可以将do（循环组）和rowwise操作组合在一起（这样您就可以将日期作为日期引用到每一行，以及.$Date作为每个组中的整个Date列：

sample_df %>% 
    group_by(ID) %>% 
    do(rowwise(.) %>% 
        mutate(PrevEnc90D = sum(Date - .$Date < 90 & Date - .$Date > 0), 
               PrevEvent90D = sum(.$Event[Date - .$Date < 90 & Date - .$Date > 0])))

#Source: local data frame [12 x 5]
#Groups: ID [2]

#         Date     ID Event PrevEnc90D PrevEvent90D
#       <date> <fctr> <dbl>      <int>        <dbl>
#1  2016-09-09      A     1          0            0
#2  2016-05-07      A     0          3            2
#3  2016-05-06      A     1          2            1
#4  2016-02-13      A     0          2            2
#5  2016-02-11      A     1          1            1
#6  2016-01-07      A     1          0            0
#7  2016-11-08      B     0          0            0
#8  2016-06-08      B     1          1            1
#9  2016-05-08      B     1          1            1
#10 2016-02-13      B     1          2            1
#11 2016-01-03      B     0          1            1
#12 2016-01-01      B     1          0            0

Answer 3

一个相当冗长的dplyr解决方案，它使用的行数比真正需要的多。我们的想法是为每个日期创建一个完全连接的表，然后使用窗口函数。如果需要不同的窗口计算，这可能很有用。

ERROR [app-router] Error: Error invoking SlickService. Check the inner error for details.
------------------------------------------------
Inner Error:
Message: key/value cannot be null or undefined. Are you trying to inject/register something that doesn't exist with DI?

来源：本地数据框[12 x 6] 组：ID [2]

library(dplyr)

dates <- data.frame(Date = seq(from = -90 + min(sample_df$Date), to = max(sample_df$Date), by=1)) 
extended_df <- data.frame(ID = unique(sample_df$ID)) %>%
  merge(dates) %>% 
  left_join(sample_df, by=(c("ID", "Date"))) %>% 
  arrange(ID, desc(Date)) %>%
  mutate(Encounter = as.integer(!is.na(Event)),
         Event = ifelse(is.na(Event), 0, Event)) %>%
  group_by(ID) %>%
  mutate(PrevEnc90D   = rollsum(lead(Encounter), k=90, fill=0, align="left"),
        PrevEvent90D  = rollsum(lead(Event),     k=90, fill=0, align="left")) %>%
  inner_join(sample_df[,c("ID", "Date")]) %>%
  arrange(ID, desc(Date))

extended_df

Answer 4

另一个想法是尽可能避免重复求和和关系运算：

do.call(rbind, 
        lapply(split(sample_df, sample_df$ID), 
               function(x) {
                   i = nrow(x) - findInterval(x$Date - 90, rev(x$Date))
                   cs = cumsum(x$Event)
                   cbind(x, PrevEnc90D = i - (1:nrow(x)), PrevEvent90D = cs[i] - cs)
               }))
#           Date ID Event PrevEnc90D PrevEvent90D
#A.1  2016-09-09  A     1          0            0
#A.2  2016-05-07  A     0          3            2
#A.3  2016-05-06  A     1          2            1
#A.4  2016-02-13  A     0          2            2
#A.5  2016-02-11  A     1          1            1
#A.6  2016-01-07  A     1          0            0
#B.7  2016-11-08  B     0          0            0
#B.8  2016-06-08  B     1          1            1
#B.9  2016-05-08  B     1          1            1
#B.10 2016-02-13  B     1          2            1
#B.11 2016-01-03  B     0          1            1
#B.12 2016-01-01  B     1          0            0

以上假设＆＃34;日期＆＃34;在每个＆＃34; ID＆＃34;内逐渐减少排序。（如果不是这样的话，这是非常简单的）。这里的主要思想是（i）找到每个日期的前90天，（ii）计算一次和前期的累积和，以及（iii）减去相应的指数/ cumsum以获得输出。我在这里使用了split / lapply路由来按＆＃34; ID＆＃34;进行分组，但我想，它很容易转移到任何工具上。

R：按ID和指定数据汇总历史记录

4 个答案: