我有以下数据集:
USERNAME API_TRACK_EVENT TIME
userA Viewed pic 1454941960
userA Order/payment 1454941972
userA Edit pic 1454941973
userA Order/Changed Address 1454941976
userB Viewed pic 1454941983
userB Order/guestlogin 1454941986
userB Order/Changed Address 1454941992
我想在数据集上执行以下操作:
TIME
APITRACK_EVENT
(目标是在下达第一个订单之前获取用户的所有api跟踪事件)。那么,我应该怎么做呢? [也开放使用dplyr。]
答案 0 :(得分:1)
在我们arrange
按'USERNAME','TIME'并按'USERNAME'分组后,我们首次出现'Order'与grepl
和which.max
的索引。添加1并从中获取序列(:
)到nrow(n()
)。由于我们需要从数据集中删除这些行,我们可以使用setdiff
来查找不在创建的索引中的行索引以及slice
它。
library(dplyr)
df1 %>%
arrange(USERNAME, TIME) %>%
group_by(USERNAME) %>%
slice(setdiff(row_number(), (which.max(grepl("Order",
API_TRACK_EVENT))+1): n()))
# USERNAME API_TRACK_EVENT TIME
# <chr> <chr> <int>
#1 userA Viewed pic 1454941960
#2 userA Order/payment 1454941972
#3 userB Viewed pic 1454941983
#4 userB Order/guestlogin 1454941986
另一种选择是使用filter
df1 %>%
arrange(USERNAME, TIME) %>%
group_by(USERNAME) %>%
filter(!lag(cumsum(grepl("Order", API_TRACK_EVENT)), default = 0))
# USERNAME API_TRACK_EVENT TIME
# <chr> <chr> <int>
#1 userA Viewed pic 1454941960
#2 userA Order/payment 1454941972
#3 userB Viewed pic 1454941983
#4 userB Order/guestlogin 1454941986