R,根据单独列中的值删除先前的行

时间:2020-10-19 10:16:05

标签: r dplyr data.table

我是R的新手。我试图根据另一列设置的条件删除先前的行。

我已经找到了dplyr和data.table的解决方案,我认为它们与我正在寻找的解决方案很接近,因为它们却相反。

样本数据:

Cust_ID | Date                 | Value
500219  | 2016-04-11 12:00:00  | 0
500219  | 2016-04-12 16:00:00  | A
500219  | 2016-04-14 11:00:00  | A
500219  | 2016-04-15 12:00:00  | B
500219  | 2016-05-23 09:00:00  | B
500219  | 2016-05-02 19:00:00  | C
500220  | 2016-04-11 12:00:00  | C
500220  | 2016-04-14 11:00:00  | C
500220  | 2016-04-15 12:00:00  | A
500220  | 2016-05-23 09:00:00  | A
500220  | 2016-05-02 19:00:00  | A

对于每个Cust_ID,我只想保留Value ==“ A”之后的行,包括该行。这将导致以下数据帧:

Cust_ID | Date                 | Value
500219  | 2016-04-12 16:00:00  | A
500219  | 2016-04-14 11:00:00  | A
500219  | 2016-04-15 12:00:00  | B
500219  | 2016-05-23 09:00:00  | B
500219  | 2016-05-02 19:00:00  | C
500220  | 2016-04-15 12:00:00  | A
500220  | 2016-05-23 09:00:00  | A
500220  | 2016-05-02 19:00:00  | A

这些是我已经找到的解决方案(R delete rows based on values in previous rows

library(data.table)
setDT(df1)[df1[,  if(any(Value == "A")) .I[seq(max(which(Value == "A")))]
                                 else .I[1:.N] , by = Cust_ID]$V1]


library(dplyr)
df1 %>% 
     group_by(Cust_ID) %>% 
     slice(if(any(Value=="A")) seq(max(which(Value=="A"))) else row_number())

3 个答案:

答案 0 :(得分:1)

也许带有subset + ave的基本R选项可以帮助

subset(df,ave(Value, Cust_ID, FUN = cumsum)>0)

给出

   Cust_ID                Date Value
3   500219 2016-04-14 11:00:00     1
4   500219 2016-04-15 12:00:00     1
5   500219 2016-05-23 09:00:00     0
6   500219 2016-05-02 19:00:00     0
8   500220 2016-04-14 11:00:00     1
9   500220 2016-04-15 12:00:00     1
10  500220 2016-05-23 09:00:00     0
11  500220 2016-05-02 19:00:00     0

答案 1 :(得分:1)

这项工作:

> library(dplyr)
> df %>% group_by(Cust_ID) %>% filter(row_number() >= min(which(Value == 'A')))
# A tibble: 8 x 3
# Groups:   Cust_ID [2]
  Cust_ID Date                Value
    <dbl> <chr>               <chr>
1  500219 2016-04-12 16:00:00 A    
2  500219 2016-04-14 11:00:00 A    
3  500219 2016-04-15 12:00:00 B    
4  500219 2016-05-23 09:00:00 B    
5  500219 2016-05-02 19:00:00 C    
6  500220 2016-04-15 12:00:00 A    
7  500220 2016-05-23 09:00:00 A    
8  500220 2016-05-02 19:00:00 A    
> 

答案 2 :(得分:0)

这是使用非等额联接的另一种选择:

df1[df1[Value=="A", Date[1L], Cust_ID], on=.(Cust_ID, Date>=V1)]

输出:

   Cust_ID                Date Value
1:  500219 2016-04-12 16:00:00     A
2:  500219 2016-04-12 16:00:00     A
3:  500219 2016-04-12 16:00:00     B
4:  500219 2016-04-12 16:00:00     B
5:  500219 2016-04-12 16:00:00     C
6:  500220 2016-04-15 12:00:00     A
7:  500220 2016-04-15 12:00:00     A
8:  500220 2016-04-15 12:00:00     A

数据:

df1 <- fread("Cust_ID | Date                 | Value
500219  | 2016-04-11 12:00:00  | 0
500219  | 2016-04-12 16:00:00  | A
500219  | 2016-04-14 11:00:00  | A
500219  | 2016-04-15 12:00:00  | B
500219  | 2016-05-23 09:00:00  | B
500219  | 2016-05-02 19:00:00  | C
500220  | 2016-04-11 12:00:00  | C
500220  | 2016-04-14 11:00:00  | C
500220  | 2016-04-15 12:00:00  | A
500220  | 2016-05-23 09:00:00  | A
500220  | 2016-05-02 19:00:00  | A")
df1[, Date := as.POSIXct(Date, format="%Y-%m-%d %T")]