我正在尝试过滤客户购买产品后购买的产品" A"。
我的样本数据集:
fk_ConsumerID ProductName Date
1 B 2015.10.12
1 A 2015.10.14
1 C 2015.10.18
1 D 2015.10.19
2 A 2015.10.10
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
3 C 2015.10.14
3 D 2015.10.18
3 A 2015.10.19
4 B 2015.10.10
我想得到的结果:
fk_ConsumerID ProductName Date
1 C 2015.10.18
1 D 2015.10.19
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
我尝试编写代码:
library(dplyr)
#Grouping customers
customers <- group_by(df, fk_ConsumerId)
#Filtering the ones that appear after A (Doesn`t work)
f<-filter(customers, ProductName > "A")
答案 0 :(得分:0)
我会尝试找到一个更简洁的解决方案,但这是一个临时的解决方案。
library(dplyr)
library(purrr)
df <- data.frame(fk_ConsumerID=c(1,1,1,1,2,2,2,2,2,3,3,3,4),
ProductName=c("B","A","C","D","A","B","C","D","E","C","D","A","B"),
Date=c(1:13)
)
df <- df %>% group_by(fk_ConsumerID) %>%
mutate(cc=ProductName=="A",
ss=seq_along(ProductName)
)
fk_ConsumerID ProductName Date cc ss
<dbl> <fctr> <int> <lgl> <int>
1 1 B 1 FALSE 1
2 1 A 2 TRUE 2
3 1 C 3 FALSE 3
4 1 D 4 FALSE 4
5 2 A 5 TRUE 1
6 2 B 6 FALSE 2
7 2 C 7 FALSE 3
8 2 D 8 FALSE 4
9 2 E 9 FALSE 5
10 3 C 10 FALSE 1
11 3 D 11 FALSE 2
12 3 A 12 TRUE 3
13 4 B 13 FALSE 1
用于列出每个fk_ConsumerID的临时数据帧和带有A的条目索引:
kk <- df[which(df$cc==TRUE),c(1,5)]
names(kk)[2] <- "idx"
> kk
Source: local data frame [3 x 2]
Groups: fk_ConsumerID [3]
fk_ConsumerID idx
<dbl> <int>
1 1 2
2 2 1
3 3 3
在新列中添加带有A的条目的索引:
getIndex <- function(x){
kk$idx[kk$fk_ConsumerID==x] %>%
as.integer
}
根据索引值进行过滤:
df <- df %>%
mutate(idx=map(fk_ConsumerID,getIndex )) %>%
filter(ss>idx) %>%
select(1:3)
Source: local data frame [6 x 3]
Groups: fk_ConsumerID [2]
fk_ConsumerID ProductName Date
<dbl> <fctr> <int>
1 1 C 3
2 1 D 4
3 2 B 6
4 2 C 7
5 2 D 8
6 2 E 9
答案 1 :(得分:0)
首先生成临时变量,然后使用productname ='A'过滤组,进一步过滤排名大于productname ='A'所在的排名。
df%>%group_by(fk_ConsumerID)%>%mutate(rank=1:n())%>%
filter(sum(ProductName=='A')>0)%>%filter(rank>rank[ProductName=='A'])%>%
select(-rank)
# fk_ConsumerID ProductName Date
<int> <chr> <chr>
1 1 C 2015.10.18
2 1 D 2015.10.19
3 2 B 2015.10.12
4 2 C 2015.10.14
5 2 D 2015.10.18
6 2 E 2015.10.19
答案 2 :(得分:-1)
这是dplyr中解决您问题的解决方案。 首先,我们找到客户购买商品的时间a。此时间存储在名为timeA的新列中。 现在只需选择在此时间之后有时间的所有行。
df %>%
group_by(fk_ConsumerID) %>%
filter(ProductName=="A") %>%
summarise(timeA = min(Date)) %>%
right_join(df) %>%
filter(!is.na(timeA),Date > timeA)
答案 3 :(得分:-1)
以下data.table
(version 1.9.7)解决方案使用非等连接:
library(data.table)
# date of first purchase of product A by each customer
# (thereby removing edge case where purchase of A was the last purchase)
fp <- dt[ProductName == "A" & Date < max(Date), .(minDate = min(Date)), by = fk_ConsumerID]
# non-equi join
dt[fp, on = c("fk_ConsumerID", "Date>minDate")]
# fk_ConsumerID ProductName Date
#1: 1 C 2015-10-14
#2: 1 D 2015-10-14
#3: 2 B 2015-10-10
#4: 2 C 2015-10-10
#5: 2 D 2015-10-10
#6: 2 E 2015-10-10
使其可重复
dt <- fread("fk_ConsumerID ProductName Date
1 B 2015.10.12
1 A 2015.10.14
1 C 2015.10.18
1 D 2015.10.19
2 A 2015.10.10
2 B 2015.10.12
2 C 2015.10.14
2 D 2015.10.18
2 E 2015.10.19
3 C 2015.10.14
3 D 2015.10.18
3 A 2015.10.19
4 B 2015.10.10")
dt[, Date := anytime::anydate(Date)]