我不确定这个标题是否真的反映了我想要做的事情。最终,我想在ActionType
列中按组选择具有特定模式的行。分组变量为email
。对于每个email
,如果ActionType
的第一行是胜利,那么我想删除它并查看第二行。如果ActionType
的第二行是胜利,那么我想删除它并移动到下一行,依此类推。
基本上条件1是每封电子邮件的第一行必须是胜利。
接下来,一旦满足,我想从第一行(这不是一场胜利)到下一场胜利中选择一切。
然后,该过程将重复进行,直到检查了所有按组划分的行。我不关心在胜利之后发生的行,除非他们在另一次胜利之前。此外,如果两次胜利是背靠背,那么我想选择直到第一场胜利的行(包括那场胜利)。删除之后发生的那个,然后继续检查行并保留在另一个胜利之前的那些行。
我尝试将cumsum
与dplyr
和data.table
一起使用,但我可能需要分几步完成。
这就是我的数据的外观:
email Action ActionType Date
wwww Company won 1/17/14
wwww Company trial 1/22/14
wwww Event Meeting 1/24/14
wwww Event Meeting 2/24/14
wwww Gmail Email 9/10/14
wwww Company won 9/11/14
wwww Company won 9/25/14
wwww Event Support 10/7/14
wwww Company won 10/22/14
wwww Company won 12/31/14
wwww Gmail Email 2/13/15
wwww Gmail Email 2/27/15
wwww Gmail Email 3/6/15
wwww Gmail Email 3/26/15
wwww Gmail Email 4/20/15
wwww Gmail Email 4/24/15
wwww Gmail Email 5/13/15
xxxx Company trial 1/17/14
xxxx Gmail Email 1/22/14
xxxx Event Meeting 1/24/14
xxxx Company won 2/24/14
xxxx Gmail Email 9/10/14
xxxx Gmail Email 9/11/14
xxxx Gmail Email 9/25/14
xxxx Gmail Email 10/7/14
xxxx Gmail Email 10/22/14
yyyy Company won 1/24/14
yyyy Company trial 2/24/14
yyyy Task Call 9/10/14
yyyy Task Call 9/11/14
yyyy Task Call 9/25/14
yyyy Company won 10/7/14
yyyy Gmail Email 10/22/14
yyyy Gmail Email 12/31/14
zzzz Company won 9/11/14
zzzz Company won 9/25/14
zzzz Task Call 10/7/14
zzzz Task Call 10/22/14
zzzz Company trial 12/31/14
zzzz Gmail Email 2/13/15
zzzz Company won 2/27/15
zzzz Gmail Email 3/6/15
zzzz Gmail Email 3/26/15
所以我希望最终结果看起来像这样。
email Action ActionType Date
wwww Company trial 1/22/14
wwww Event Meeting 1/24/14
wwww Event Meeting 2/24/14
wwww Gmail Email 9/10/14
wwww Company won 9/11/14
wwww Event Support 10/7/14
wwww Company won 10/22/14
xxxx Company trial 1/17/14
xxxx Gmail Email 1/22/14
xxxx Event Meeting 1/24/14
xxxx Company won 2/24/14
yyyy Company trial 2/24/14
yyyy Task Call 9/10/14
yyyy Task Call 9/11/14
yyyy Task Call 9/25/14
yyyy Company won 10/7/14
zzzz Task Call 10/7/14
zzzz Task Call 10/22/14
zzzz Company trial 12/31/14
zzzz Gmail Email 2/13/15
zzzz Company won 2/27/15
答案 0 :(得分:2)
这是一种方式:
library(data.table)
# cut off leading wins and trailing nonwins
goodi = DT[, .I[
rev(cumsum(rev(ActionType=="won"))) > 0L &
cumsum(ActionType!="won") > 0L
], by=email]$V1
# take the first win when there's a succession of 'em
DT[goodi, r := rleid(ActionType=="won"), by=email]
badi = DT[!is.na(r), .I[ ActionType=="won" & 1:.N > 1], by=.(email,r)]$V1
DT[, r := NULL]
DT[setdiff(goodi,badi)]
给出了所需的输出
email Action ActionType Date
1: wwww Company trial 1/22/14
2: wwww Event Meeting 1/24/14
3: wwww Event Meeting 2/24/14
4: wwww Gmail Email 9/10/14
5: wwww Company won 9/11/14
6: wwww Event Support 10/7/14
7: wwww Company won 10/22/14
8: xxxx Company trial 1/17/14
9: xxxx Gmail Email 1/22/14
10: xxxx Event Meeting 1/24/14
11: xxxx Company won 2/24/14
12: yyyy Company trial 2/24/14
13: yyyy Task Call 9/10/14
14: yyyy Task Call 9/11/14
15: yyyy Task Call 9/25/14
16: yyyy Company won 10/7/14
17: zzzz Task Call 10/7/14
18: zzzz Task Call 10/22/14
19: zzzz Company trial 12/31/14
20: zzzz Gmail Email 2/13/15
21: zzzz Company won 2/27/15
email Action ActionType Date