如何在每个事件之前知道上次日志? R语言

时间:2016-08-11 11:02:49

标签: r

这是我的表:

user_id    event       timestamp
Rob        business    111111
Rob        progress    111112
Rob        business    222222
Mike       progress    111111
Mike       progress    222222
Rob        progress    000001
Mike       business    333333
Mike       progress    444444
Lee        progress    111111
Lee        progress    222222
Mike       business    333334

输入表:

    dput(input)
    df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L),
 .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
 event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L),
 .Label = c("business", "progress"), class = "factor"), 
timestamp = c(111111,111112, 222222, 111111, 222222, 1, 333333, 444444, 111111, 222222, 333334)), 
.Names = c("user_id", "event", "timestamp"), row.names = c(NA, -11L), class = "data.frame")

我想知道每progress个事件发生前business事件发生前user_id事件:

    user_id    event       timestamp
    Mike       progress    222222
    Mike       progress    222222
    Rob        progress    111112
    Rob        progress         1

寻求帮助!

2 个答案:

答案 0 :(得分:1)

只要我正确理解问题,看起来可以通过使用lag函数和dplyr来解决。

以下是一个例子:

# Set up the data structure
df <- structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
    2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
    event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
    "progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
    111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
    "event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

# Perform the manipulation
df %>% 
    arrange(user_id, timestamp) %>% # Sort by user and timestamp
    group_by(user_id) %>% # Group/partition by each user
    mutate(last_event = lag(event, 1), # Find the last event
           last_timestamp = lag(timestamp, 1)) %>% # And the time it occurred
    filter(event == "business") %>% # Chop down to just the business events - as that's what we're interested in
    select(user_id, last_event, last_timestamp) %>% # Select the fields of interest
    rename(event = last_event, # Tidy up the field names
           timestamp = last_timestamp)
  user_id    event timestamp
   <fctr>   <fctr>     <dbl>
1    Mike progress    222222
2     Rob progress         1
3     Rob progress    111112

但是,如果每个business事件之前的事件进展,则此方法无效。一个简单的修复只是过滤到businessprogress事件,但是:

df %>% 
    filter(event == "business"|event == "progress") %>% 
    arrange(user_id, timestamp) %>% 
    group_by(user_id) %>% 
    mutate(last_event = lag(event, 1),
           last_timestamp = lag(timestamp, 1)) %>% 
    filter(event == "business") %>% 
    select(user_id, last_event, last_timestamp) %>% 
    rename(event = last_event, 
           timestamp = last_timestamp)

在这个数据集上,输出将是相同的,但如果其他事件蔓延,这可能是必要的步骤。

答案 1 :(得分:0)

df <-
structure(list(user_id = structure(c(3L, 3L, 3L, 2L, 2L, 3L, 2L, 
2L, 1L, 1L), .Label = c("Lee", "Mike", "Rob"), class = "factor"), 
event = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("business", 
"progress"), class = "factor"), timestamp = c(111111,111112, 222222, 
111111, 222222, 1, 333333, 444444, 111111, 222222)), .Names = c("user_id", 
"event", "timestamp"), row.names = c(NA, -10L), class = "data.frame")

#I want to know last progress event before every business event happens

new <- df[0,]  
for(i in 2:nrow(df)){
  if(df$event[i] == "business" & df$event[i-1] == "progress"){
   new <- rbind(new, df[i-1,]) 
  }
}  
new
  user_id    event timestamp
2     Rob progress    111112
6     Rob progress         1

注意结果中只有2行,因为business只出现了三次,并且第一次出现在第一行。