按组检测序列并计算子集的新变量

时间:2019-04-11 18:30:50

标签: r performance dataframe group-by data.table

我需要在data.frame中按组检测序列并计算新变量。

考虑一下,我在data.frame之后有这个名字:

df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
              seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
              count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
              product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
              stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))

df1

> df1
   ID seqs count product   stock
1   1    1     2       A       A
2   1    2     1       B     A,B
3   1    3     3       C   A,B,C
4   1    4     1       C   A,B,C
5   1    5     1     A,B   A,B,C
6   1    6     2   A,B,C   A,B,C
7   1    7     3       D A,B,C,D
8   2    1     1       A       A
9   2    2     2       B     A,B
10  2    3     1       A     A,B
11  3    1     3       A       A
12  3    2     1   A,B,C   A,B,C
13  3    3     4       D A,B,C,D
14  3    4     1       D A,B,C,D

我有兴趣按照以下顺序为ID计算一个量度:

  - Count == 1
  - Count > 1
  - Count == 1

在示例中,这适用于:

 - rows 2, 3, 4 for `ID==1`
 - rows 8, 9, 10 for `ID==2`
 - rows 12, 13, 14 for `ID==3`

对于这些ID和行,我需要计算一个称为new的度量,该度量采用序列product的最后一行的if的值,该值位于第二行而不是第一个序列的stock中的

所需结果如下所示:

> output
  ID seq1 seq2 seq3 new
1  1    2    3    4   C
2  2    1    2    3    
3  3    2    3    4   D

注意:

  1. 在检测到ID的顺序中,没有新产品添加到库存中。
  2. 在原始数据中,有许多没有任何序列的ID。
  3. 有些ID具有多个限定序列。全部都应该记录下来。
  4. 计数始终为1或更大。
  5. 原始数据包含数以百万计的ID,最多1500个序列。

您如何编写一段有效的代码来获得此输出?

2 个答案:

答案 0 :(得分:1)

这是一个data.table选项:

library(data.table)

char_cols <- c("product", "stock")
setDT(df1)[, 
           (char_cols) := lapply(.SD, as.character), 
           .SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) & 
            (shift(count) > 1) & 
            (shift(count, 2L) == 1), 
     by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) & 
            !mapply(grepl, pat, shift(stock, 2L)), 
    by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result

答案 1 :(得分:1)

这是使用的另一种方法;但是,我认为laglead使该解决方案有些耗时。我在代码中包含了注释,以使其更清晰。

但是我花了足够的时间在它上发布它。

library(tidyverse)

df1 %>% group_by(ID) %>%  

 # this finds the row with count > 1 which ...
 #... the counts of the row before and the one of after it equals to 1
 mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>% 

 # this makes a column which has value of True for each chunk...      
 #that meets desired condition to later filter based on it
 mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F))  %>% 

 filter(test2) %>% ungroup() %>% 

 # group each three occurrences in case of having multiple ones within each ID
 group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>% 

 # creating new column with string extracting techniques ...
 #... (assuming those columns are characters) 
 mutate(new=
 str_remove_all(
    as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
               stock[1])) %>% 

  # selecting desired columns and adding times for long to wide conversion
  select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>% 

  # long to wide conversion using tidyr (part of tidyverse)
  gather(key, value, -ID, -G, -new, -times) %>%
  unite(col, key, times) %>% spread(col, value) %>% 

  # making the desired order of columns
  select(-G,-new,new) %>% as.data.frame()

#   ID seqs_1 seqs_2 seqs_3 new
# 1  1      2      3      4   C
# 2  2      1      2      3    
# 3  3      2      3      4   D