根据开始/结束模式分配变量

时间:2018-07-12 01:42:24

标签: r dataframe dplyr grouping data-manipulation

我有这个数据集:

a <- data.frame("session_id" = c(rep(1,10), rep(2,7), rep(3,2)),
                "content" = c("A", "B", "C","open", "A", "J", "M", "K","exit", "D", 
                "open", "U", "T","quit", "I", "M" , "A", "Q", "M" ), 
            "type" = c("non-edit", "non-edit", "non-edit", "edit", "edit", "edit", 
            "edit", "edit", "edit", "non-edit", "edit", "edit", "edit", 
            "edit", "non-edit", "non-edit", "non-edit", "non-edit", "non-edit"))

我希望根据内容列将类型列分配给“非编辑”或“编辑”类型。当我们在内容中检测到“打开”直到“退出”或“退出”时,类型将为“编辑”。您可以在我提供的示例中看到该示例。

5 个答案:

答案 0 :(得分:3)

我们创建一个新列(new_type)并将值初始化为“非编辑”。然后,我们找到出现“打开”和“退出”的索引,并使用mapply在它们之间创建一个索引序列,并将相应的值替换为“编辑”

a$new_type <- "non-edit"
open_ind <- which(a$content == "open")
close_ind <- which(a$content %in% c("quit", "exit"))
a$new_type[unlist(mapply(":", open_ind, close_ind))] <- "edit"


a
#   session_id content     type new_type
#1           1       A non-edit non-edit
#2           1       B non-edit non-edit
#3           1       C non-edit non-edit
#4           1    open     edit     edit
#5           1       A     edit     edit
#6           1       J     edit     edit
#7           1       M     edit     edit
#8           1       K     edit     edit
#9           1    exit     edit     edit
#10          1       D non-edit non-edit
#11          2    open     edit     edit
#12          2       U     edit     edit
#13          2       T     edit     edit
#14          2    quit     edit     edit
#15          2       I non-edit non-edit
#16          2       M non-edit non-edit
#17          2       A non-edit non-edit
#18          3       Q non-edit non-edit
#19          3       M non-edit non-edit

要了解这些步骤,

open_ind
#[1]  4 11
close_ind
#[1]  9 14
unlist(mapply(":", open_ind, close_ind))
#[1]  4  5  6  7  8  9 11 12 13 14

答案 1 :(得分:2)

按“ session_id”分组后,通过取逻辑表达式的累加和来创建另一个组,并将其用于关联值“ edit”和“ non-edit”

library(dplyr)
a %>% 
  group_by(session_id) %>% 
  group_by(grp = cumsum((content == "open")|
     lag(content %in% c("exit", "quit"), 
              default = first(content))), add = TRUE) %>%
  mutate(type1 = case_when(any(content %in% c("open", "exit", "quit")) ~ "edit", 
                         TRUE ~ "non-edit")) %>%
  ungroup %>%
  select(-grp)
# A tibble: 19 x 4
#   session_id content type     type1   
#        <dbl> <fct>   <fct>    <chr>   
# 1          1 A       non-edit non-edit
# 2          1 B       non-edit non-edit
# 3          1 C       non-edit non-edit
# 4          1 open    edit     edit    
# 5          1 A       edit     edit    
# 6          1 J       edit     edit    
# 7          1 M       edit     edit    
# 8          1 K       edit     edit    
# 9          1 exit    edit     edit    
#10          1 D       non-edit non-edit
#11          2 open    edit     edit    
#12          2 U       edit     edit    
#13          2 T       edit     edit    
#14          2 quit    edit     edit    
#15          2 I       non-edit non-edit
#16          2 M       non-edit non-edit
#17          2 A       non-edit non-edit
#18          3 Q       non-edit non-edit
#19          3 M       non-edit non-edit

答案 2 :(得分:1)

这里是不需要分组的管道。

library(dplyr)
library(tidyr)

b <- 
    a %>% 
    # 1. Mark the boundaries of the 'edit' regions.
    mutate(type = case_when(content == "open"           ~ "edit", 
                            grepl("exit|quit", content) ~ "non-edit",
                                                   TRUE ~ NA_character_)) %>%
    # 2. Fill the NAs with the last good value. 'open' down to 'exit/quit'
    #    will be filled with 'edit'.
    tidyr::fill(type) %>%
    # 3. Replace unfilled NAs, like at the top of the table.
    replace_na(list(type = "non-edit")) %>%
    # 4. Rename the exit/quit boundary.
    mutate(type = ifelse(grepl("exit|quit", content), "edit", type))

b

#>    session_id content     type
#> 1           1       A non-edit
#> 2           1       B non-edit
#> 3           1       C non-edit
#> 4           1    open     edit
#> 5           1       A     edit
#> 6           1       J     edit
#> 7           1       M     edit
#> 8           1       K     edit
#> 9           1    exit     edit
#> 10          1       D non-edit
#> 11          2    open     edit
#> 12          2       U     edit
#> 13          2       T     edit
#> 14          2    quit     edit
#> 15          2       I non-edit
#> 16          2       M non-edit
#> 17          2       A non-edit
#> 18          3       Q non-edit
#> 19          3       M non-edit

答案 3 :(得分:0)

计划:在内容列中逐步查找“转换键”。如果key为“ open”,则立即执行操作;如果key为“ quit”或“ exit”,则执行下一行。 考虑以下代码来实现:

last  <-  'exit'  #initialize last
keys <- c('open','exit','quit')  #transition keys
for (i in 1:nrow(a)) {
a$type[i]  <-  ifelse(a$content[i] %in% keys, 'edit', 
ifelse(last=='open','edit','non-edit'))
last  <- ifelse(a$content[i]%in% keys, a$content[i],last)
}
a
R> a
   session_id content     type
1           1       A non-edit
2           1       B non-edit
3           1       C non-edit
4           1    open     edit
5           1       A     edit
6           1       J     edit
7           1       M     edit
8           1       K     edit
9           1    exit     edit
10          1       D non-edit
11          2    open     edit
12          2       U     edit
13          2       T     edit
14          2    quit     edit
15          2       I non-edit
16          2       M non-edit
17          2       A non-edit
18          3       Q non-edit
19          3       M non-edit

答案 4 :(得分:0)

这是使用cumsum在基数R中的一种方法:

a$new_type <- c("non-edit","edit")[
  cumsum(a$content=="open") - c(0,head(cumsum(a$content %in% c("exit","quit")),-1)) +1]
#    session_id content     type new_type
# 1           1       A non-edit non-edit
# 2           1       B non-edit non-edit
# 3           1       C non-edit non-edit
# 4           1    open     edit     edit
# 5           1       A     edit     edit
# 6           1       J     edit     edit
# 7           1       M     edit     edit
# 8           1       K     edit     edit
# 9           1    exit     edit     edit
# 10          1       D non-edit non-edit
# 11          2    open     edit     edit
# 12          2       U     edit     edit
# 13          2       T     edit     edit
# 14          2    quit     edit     edit
# 15          2       I non-edit non-edit
# 16          2       M non-edit non-edit
# 17          2       A non-edit non-edit
# 18          3       Q non-edit non-edit
# 19          3       M non-edit non-edit