我是R的新手。我试图根据另一列设置的条件删除先前的行。
我已经找到了dplyr和data.table的解决方案,我认为它们与我正在寻找的解决方案很接近,因为它们却相反。
样本数据:
Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | A
500219 | 2016-04-14 11:00:00 | A
500219 | 2016-04-15 12:00:00 | B
500219 | 2016-05-23 09:00:00 | B
500219 | 2016-05-02 19:00:00 | C
500220 | 2016-04-11 12:00:00 | C
500220 | 2016-04-14 11:00:00 | C
500220 | 2016-04-15 12:00:00 | A
500220 | 2016-05-23 09:00:00 | A
500220 | 2016-05-02 19:00:00 | A
对于每个Cust_ID,我只想保留Value ==“ A”之后的行,包括该行。这将导致以下数据帧:
Cust_ID | Date | Value
500219 | 2016-04-12 16:00:00 | A
500219 | 2016-04-14 11:00:00 | A
500219 | 2016-04-15 12:00:00 | B
500219 | 2016-05-23 09:00:00 | B
500219 | 2016-05-02 19:00:00 | C
500220 | 2016-04-15 12:00:00 | A
500220 | 2016-05-23 09:00:00 | A
500220 | 2016-05-02 19:00:00 | A
这些是我已经找到的解决方案(R delete rows based on values in previous rows)
library(data.table)
setDT(df1)[df1[, if(any(Value == "A")) .I[seq(max(which(Value == "A")))]
else .I[1:.N] , by = Cust_ID]$V1]
library(dplyr)
df1 %>%
group_by(Cust_ID) %>%
slice(if(any(Value=="A")) seq(max(which(Value=="A"))) else row_number())
答案 0 :(得分:1)
也许带有subset
+ ave
的基本R选项可以帮助
subset(df,ave(Value, Cust_ID, FUN = cumsum)>0)
给出
Cust_ID Date Value
3 500219 2016-04-14 11:00:00 1
4 500219 2016-04-15 12:00:00 1
5 500219 2016-05-23 09:00:00 0
6 500219 2016-05-02 19:00:00 0
8 500220 2016-04-14 11:00:00 1
9 500220 2016-04-15 12:00:00 1
10 500220 2016-05-23 09:00:00 0
11 500220 2016-05-02 19:00:00 0
答案 1 :(得分:1)
这项工作:
> library(dplyr)
> df %>% group_by(Cust_ID) %>% filter(row_number() >= min(which(Value == 'A')))
# A tibble: 8 x 3
# Groups: Cust_ID [2]
Cust_ID Date Value
<dbl> <chr> <chr>
1 500219 2016-04-12 16:00:00 A
2 500219 2016-04-14 11:00:00 A
3 500219 2016-04-15 12:00:00 B
4 500219 2016-05-23 09:00:00 B
5 500219 2016-05-02 19:00:00 C
6 500220 2016-04-15 12:00:00 A
7 500220 2016-05-23 09:00:00 A
8 500220 2016-05-02 19:00:00 A
>
答案 2 :(得分:0)
这是使用非等额联接的另一种选择:
df1[df1[Value=="A", Date[1L], Cust_ID], on=.(Cust_ID, Date>=V1)]
输出:
Cust_ID Date Value
1: 500219 2016-04-12 16:00:00 A
2: 500219 2016-04-12 16:00:00 A
3: 500219 2016-04-12 16:00:00 B
4: 500219 2016-04-12 16:00:00 B
5: 500219 2016-04-12 16:00:00 C
6: 500220 2016-04-15 12:00:00 A
7: 500220 2016-04-15 12:00:00 A
8: 500220 2016-04-15 12:00:00 A
数据:
df1 <- fread("Cust_ID | Date | Value
500219 | 2016-04-11 12:00:00 | 0
500219 | 2016-04-12 16:00:00 | A
500219 | 2016-04-14 11:00:00 | A
500219 | 2016-04-15 12:00:00 | B
500219 | 2016-05-23 09:00:00 | B
500219 | 2016-05-02 19:00:00 | C
500220 | 2016-04-11 12:00:00 | C
500220 | 2016-04-14 11:00:00 | C
500220 | 2016-04-15 12:00:00 | A
500220 | 2016-05-23 09:00:00 | A
500220 | 2016-05-02 19:00:00 | A")
df1[, Date := as.POSIXct(Date, format="%Y-%m-%d %T")]