Question

我有一个大型数据集，它具有个人和日期的唯一ID，并且每个人都能够多次遇到。

以下是代码以及此数据的外观示例：

public class WebMvcConfig extends WebMvcConfigurerAdapter {

    @Override
    public void addInterceptors(InterceptorRegistry registry) {
        registry.addInterceptor(new CorsInterceptor(Arrays.asList("'http://myApplication.myDomain.net","http://www.myApplication.myDomain.net")));
    }

    ...

}

我想在每次遭遇时保留所有附加信息，但是然后通过id

汇总以下历史信息

以前遭遇的人数
以前的活动数量

举个例子，让我们看看第2行。

第2行是ID A，所以我会引用第3-5行（在第2行遇到之前发生）。在这组行中，我们看到第3行和第3行。 5都有事件。

第2行的先前遭遇次数= 3

第2行的上一个事件数= 2

理想情况下，我会得到以下输出：

strDates <- c("09/09/16", "6/7/16", "5/6/16", "2/3/16", "2/1/16", "11/8/16",      
"6/8/16", "5/8/16","2/3/16","1/1/16")
Date<-as.Date(strDates, "%m/%d/%y")
ID <- c("A", "A", "A", "A","A","B","B","B","B","B")
Event <- c(1,0,1,0,1,0,1,1,1,0)
sample_df <- data.frame(Date,ID,Event)

sample_df

         Date ID Event
1  2016-09-09  A     1
2  2016-06-07  A     0
3  2016-05-06  A     1
4  2016-02-03  A     0
5  2016-02-01  A     1
6  2016-11-08  B     0
7  2016-06-08  B     1
8  2016-05-08  B     1
9  2016-02-03  B     1
10 2016-01-01  B     0

到目前为止，我已尝试在dplyr中使用mutate和summary来解决这个问题，这两个问题都没有让我成功地将我的聚合限制为先前针对特定ID发生的事件。我尝试过使用If-then语句的一些混乱的For循环，但实际上只是想知道是否存在一个包或技术来简化这个过程。

谢谢！

Answer 1

或者，如果您想尝试data.table，可以使用：

library(data.table)

# Convert to data.table and sort
sample_dt <- as.data.table(sample_df)
sample_dt <- sample_dt[order(Date)]

# Count only the previous Events with 1
sample_dt[, prevEvent := ifelse(Event == 1, cumsum(Event) - 1, cumsum(Event)), by = "ID"]

# .I gives the row number, and .SD contains the Subset of the Data for each group
sample_dt[, prevEnc := .SD[,.I - 1], by = "ID"]

print(sample_dt)
          Date ID Event prevEvent prevEnc
 1: 2016-01-01  B     0         0       0
 2: 2016-02-01  A     1         0       0
 3: 2016-02-03  A     0         1       1
 4: 2016-02-03  B     1         0       1
 5: 2016-05-06  A     1         1       2
 6: 2016-05-08  B     1         1       2
 7: 2016-06-07  A     0         2       3
 8: 2016-06-08  B     1         2       3
 9: 2016-09-09  A     1         2       4
10: 2016-11-08  B     0         3       4

如果你不知道package，那么对于大多数操作来说都有一个很好的cheat sheet。

Answer 2

最大的障碍是当前的排序顺序。在这里，我存储了一个原始索引点，我后来用它来重新排序数据（然后将其删除）。除此之外，基本思路是从0开始计算遭遇次数，并使用cumsum计算事件的发生时间。为此，lag用于避免计算当前事件。

sample_df %>%
  mutate(origIndex = 1:n()) %>%
  group_by(ID) %>%
  arrange(ID, Date) %>%
  mutate(PrevEncounters = 0:(n() -1)
         , PrevEvents = cumsum(lag(Event, default = 0))) %>%
  arrange(origIndex) %>%
  select(-origIndex)

给予

         Date     ID Event PrevEncounters PrevEvents
       <date> <fctr> <dbl>          <int>      <dbl>
1  2016-09-09      A     1              4          2
2  2016-06-07      A     0              3          2
3  2016-05-06      A     1              2          1
4  2016-02-03      A     0              1          1
5  2016-02-01      A     1              0          0
6  2016-11-08      B     0              4          3
7  2016-06-08      B     1              3          2
8  2016-05-08      B     1              2          1
9  2016-02-03      B     1              1          0
10 2016-01-01      B     0              0          0

Answer 3

正如@Frank和@MarkPeterson所指出的，这里最大的障碍是Date列按降序排序。另一种方法不需要使用Date列：

library(dplyr)
res <- sample_df %>% group_by(ID) %>% 
                     mutate(PrevEnc=n()-row_number(),
                            PrevEvent=rev(cumsum(lag(rev(Event), default=0))))

在这里，我们使用row_number()来确定行索引，使用n()来确定行数（按ID分组）。由于Date按降序排序，因此之前遭遇的次数仅为n()-row_number()。为了计算先前事件的数量，我们再次利用Date列按降序排序并使用rev来反转Event列的顺序cumsum之前的事实这个反向列的lag。然后我们再次使用rev将结果反转回原始订单。

使用您的数据：

print(res)
##Source: local data frame [10 x 5]
##Groups: ID [2]
##
##         Date     ID Event PrevEnc PrevEvent
##       <date> <fctr> <dbl>   <int>     <dbl>
##1  2016-09-09      A     1       4         2
##2  2016-06-07      A     0       3         2
##3  2016-05-06      A     1       2         1
##4  2016-02-03      A     0       1         1
##5  2016-02-01      A     1       0         0
##6  2016-11-08      B     0       4         3
##7  2016-06-08      B     1       3         2
##8  2016-05-08      B     1       2         1
##9  2016-02-03      B     1       1         0
##10 2016-01-01      B     0       0         0

R：按ID按日期汇总历史记录

3 个答案: