Question

我正在尝试对课程中学员的当前状态进行分类。这是这篇文章的扩展：purrr pmap to read max column name by column name number。我的数据集如下：

library(dplyr)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                   status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                   status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                   status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

我想对人们的当前状态进行分类。如果某人以任何状态完成了课程，则其状态为“已完成”。但是，棘手的是它们的注册状态。如果某人的最终状态为已注册，则该人为“已注册”；如果后来的状态为“ NA”，则该人为“已注册”。如果取消或取消注册后的状态，则不会注册。因此，最终数据集应如下所示：

library(dplyr)
solution <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                   status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                   status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                   status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                   current = c("Not Taken", "Registered", "Completed", "Registered", "Registered", "Not Taken"))

未接受安格拉（Angela）是因为她在注册后退出了。克莱尔之所以注册，是因为尽管她过去曾提款，但她最近才注册。贾斯汀完成了，因为他以任何身份完成了课程。鲍勃因为尚未退出或取消课程而被注册。与克莱尔类似，约瑟夫注册的时间比他退出的时间更近，因此他已注册。最后，吉尔被“未接受”，因为他的课程被取消了，并且他没有新的注册。

这是我的代码：

library(tidyverse)
solution %>% 
  mutate(
    test =
      pmap_chr(select(., contains("status")), ~
        case_when(
          any(str_detect(c(...), "(?i)Completed")) ~ "Completed",
          any(str_detect(c(...), "(?i)Exempt")) | any(str_detect(c(...), "(?i)Incomplete")) ~ "Exclude",
          length(c(...) == "Registered") > length(c(...) == "Withdrawn") | length(c(...) == "Registered") > length(c(...) == "Cancelled")  ~ "Registered",
          any(str_detect(c(...), "(?i)No Show")) | any(str_detect(c(...), "(?i)Denied")) | any(str_detect(c(...), "(?i)Cancelled")) | any(str_detect(c(...), "(?i)Waitlist Expired")) || any(str_detect(c(...), "(?i)Withdrawn")) ~ "Not Taken",
          TRUE ~ "NA"
        )
      )
  )

我不知道如何用注册部分破解代码。理想情况下，我想保留尽可能多的代码，因为我的真实数据集具有许多状态列。

谢谢！

Answer 1

我认为，如果您首先根据"tidy data" principles稍微重新排列数据，每次学生的状态更改都记录在单独的一行中，则此问题更容易解决。重新排列的数据在problem_wrangled中。

然后，通常只能通过查看最新状态来确定当前状态，但“已完成”状态除外，我们将检查所有过去状态。

library(tidyverse)

problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                  status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                  status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                  status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                  status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

status_wrangled <- problem %>%
  gather(key = "time", value = "status", starts_with("status")) %>%
  mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
  arrange(name, time) %>%
  filter(status != "NA")

head(status_wrangled)
#> # A tibble: 6 x 3
#>   name   time  status    
#>   <chr>  <chr> <chr>     
#> 1 Angela 1     Registered
#> 2 Angela 2     Withdrawn 
#> 3 Angela 4     Withdrawn 
#> 4 Bob    1     Registered
#> 5 Claire 1     Withdrawn 
#> 6 Claire 2     Withdrawn

status_current <- status_wrangled %>%
  group_by(name) %>%
  summarize(
    current = case_when(
      # Has student completed at any time?
      "Completed" %in% status ~ "Completed",
      # Examine last recorded status
      tail(status, 1) %in% c("Exempt", "Incomplete") ~ "Exclude",
      tail(status, 1) %in% c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired") ~ "Not Taken",
      tail(status, 1) == "Registered" ~ "Registered",
      TRUE ~ "Unknown"
    )
  )

print(status_current, n = Inf)
#> # A tibble: 6 x 2
#>   name   current   
#>   <chr>  <chr>     
#> 1 Angela Not Taken 
#> 2 Bob    Registered
#> 3 Claire Registered
#> 4 Gil    Not Taken 
#> 5 Joseph Registered
#> 6 Justin Completed

^{由reprex package（v0.3.0）于2019-06-17创建}

编辑：关于您进行近似匹配的评论：我修改了示例，以允许近似匹配达到一定的编辑距离。您可能想对其进行调整，但是允许最多进行三个左右的编辑似乎是合理的。但是请注意，“未完成”和“已完成”之间的区别只是四个修改。

library(tidyverse)

problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
                  status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
                  status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
                  status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
                  status_4 = c("Withdrawnn", "Registered", "Withdrawn", "NA", "Registered", "NA"))

status_wrangled <- problem %>%
  gather(key = "time", value = "status", starts_with("status")) %>%
  mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
  arrange(name, time) %>%
  filter(status != "NA")

# Find if input vector matches to *any* given pattern below the specified edit distance
any_fuzzy_match <- function(x, patterns, max.distance = 3) {
  matches <- map(paste0("^", patterns, "$"), agrepl, x = x, max.distance = max.distance, fixed = FALSE)
  reduce(matches, `|`)
}

status_current <- status_wrangled %>%
  group_by(name) %>%
  summarize(
    current = case_when(
      # Has student completed at any time?
      any(any_fuzzy_match(status, "Completed")) ~ "Completed",
      # Examine last recorded status
      any_fuzzy_match(tail(status, 1), c("Exempt", "Incomplete")) ~ "Exclude",
      any_fuzzy_match(tail(status, 1), c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired")) ~ "Not Taken",
      any_fuzzy_match(tail(status, 1), "Registered") ~ "Registered",
      TRUE ~ "Unknown"
    )
  )

print(status_current, n = Inf)
#> # A tibble: 6 x 2
#>   name   current   
#>   <chr>  <chr>     
#> 1 Angela Not Taken 
#> 2 Bob    Registered
#> 3 Claire Registered
#> 4 Gil    Not Taken 
#> 5 Joseph Registered
#> 6 Justin Completed

^{由reprex package（v0.3.0）于2019-06-18创建}

case_when有条件地使用purrr读取列序列

1 个答案: