这是我的表:
user_id event timestamp
Rob business 111111
Rob progress 111112
Rob business 222222
Mike progress 111111
Mike progress 222222
Rob progress 000001
Mike business 333333
Mike progress 444444
Lee progress 111111
Lee progress 222222
Mike business 333334
输入表:
dput(input)
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L),
.Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L),
.Label = c("business", "progress"), class = "factor"),
timestamp = c(111111,111112, 222222, 111111, 222222, 1, 333333, 444444, 111111, 222222, 333334)),
.Names = c("user_id", "event", "timestamp"), row.names = c(NA, -11L), class = "data.frame")
我想知道每progress
个事件发生前business
事件发生前user_id
事件:
user_id event timestamp
Mike progress 222222
Mike progress 222222
Rob progress 111112
Rob progress 1
寻求帮助!
答案 0 :(得分:1)
只要我正确理解问题,看起来可以通过使用lag
函数和dplyr
来解决。
以下是一个例子:
# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L,
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business",
"progress"), class = "factor"), timestamp = c(111111,111112, 222222,
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id",
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")
# Perform the manipulation
df %>%
arrange(user_id, timestamp) %>% # Sort by user and timestamp
group_by(user_id) %>% # Group/partition by each user
mutate(last_event = lag(event, 1), # Find the last event
last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
rename(event = last_event, # Tidy up the field names
timestamp = last_timestamp)
user_id event timestamp <fctr> <fctr> <dbl> 1 Mike progress 222222 2 Rob progress 1 3 Rob progress 111112
但是,如果每个business
事件之前的事件不进展,则此方法无效。一个简单的修复只是过滤到business
和progress
事件,但是:
df %>%
filter(event == "business"|event == "progress") %>%
arrange(user_id, timestamp) %>%
group_by(user_id) %>%
mutate(last_event = lag(event, 1),
last_timestamp = lag(timestamp, 1)) %>%
filter(event == "business") %>%
select(user_id, last_event, last_timestamp) %>%
rename(event = last_event,
timestamp = last_timestamp)
在这个数据集上,输出将是相同的,但如果其他事件蔓延,这可能是必要的步骤。
答案 1 :(得分:0)
df <-
structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L,
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"),
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business",
"progress"), class = "factor"), timestamp = c(111111,111112, 222222,
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id",
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")
#I want to know last progress event before every business event happens
new <- df[0,]
for(i in 2:nrow(df)){
if(df$event[i] == "business" & df$event[i-1] == "progress"){
new <- rbind(new, df[i-1,])
}
}
new
user_id event timestamp 2 Rob progress 111112 6 Rob progress 1
注意结果中只有2行,因为business
只出现了三次,并且第一次出现在第一行。