我正在尝试对课程中学员的当前状态进行分类。这是这篇文章的扩展:purrr pmap to read max column name by column name number。我的数据集如下:
library(dplyr)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
我想对人们的当前状态进行分类。如果某人以任何状态完成了课程,则其状态为“已完成”。但是,棘手的是它们的注册状态。如果某人的最终状态为已注册,则该人为“已注册”;如果后来的状态为“ NA”,则该人为“已注册”。如果取消或取消注册后的状态,则不会注册。因此,最终数据集应如下所示:
library(dplyr)
solution <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"),
current = c("Not Taken", "Registered", "Completed", "Registered", "Registered", "Not Taken"))
未接受安格拉(Angela)是因为她在注册后退出了。克莱尔之所以注册,是因为尽管她过去曾提款,但她最近才注册。贾斯汀完成了,因为他以任何身份完成了课程。鲍勃因为尚未退出或取消课程而被注册。与克莱尔类似,约瑟夫注册的时间比他退出的时间更近,因此他已注册。最后,吉尔被“未接受”,因为他的课程被取消了,并且他没有新的注册。
这是我的代码:
library(tidyverse)
solution %>%
mutate(
test =
pmap_chr(select(., contains("status")), ~
case_when(
any(str_detect(c(...), "(?i)Completed")) ~ "Completed",
any(str_detect(c(...), "(?i)Exempt")) | any(str_detect(c(...), "(?i)Incomplete")) ~ "Exclude",
length(c(...) == "Registered") > length(c(...) == "Withdrawn") | length(c(...) == "Registered") > length(c(...) == "Cancelled") ~ "Registered",
any(str_detect(c(...), "(?i)No Show")) | any(str_detect(c(...), "(?i)Denied")) | any(str_detect(c(...), "(?i)Cancelled")) | any(str_detect(c(...), "(?i)Waitlist Expired")) || any(str_detect(c(...), "(?i)Withdrawn")) ~ "Not Taken",
TRUE ~ "NA"
)
)
)
我不知道如何用注册部分破解代码。理想情况下,我想保留尽可能多的代码,因为我的真实数据集具有许多状态列。
谢谢!
答案 0 :(得分:1)
我认为,如果您首先根据"tidy data" principles稍微重新排列数据,每次学生的状态更改都记录在单独的一行中,则此问题更容易解决。重新排列的数据在problem_wrangled
中。
然后,通常只能通过查看最新状态来确定当前状态,但“已完成”状态除外,我们将检查所有过去状态。
library(tidyverse)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
status_wrangled <- problem %>%
gather(key = "time", value = "status", starts_with("status")) %>%
mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
arrange(name, time) %>%
filter(status != "NA")
head(status_wrangled)
#> # A tibble: 6 x 3
#> name time status
#> <chr> <chr> <chr>
#> 1 Angela 1 Registered
#> 2 Angela 2 Withdrawn
#> 3 Angela 4 Withdrawn
#> 4 Bob 1 Registered
#> 5 Claire 1 Withdrawn
#> 6 Claire 2 Withdrawn
status_current <- status_wrangled %>%
group_by(name) %>%
summarize(
current = case_when(
# Has student completed at any time?
"Completed" %in% status ~ "Completed",
# Examine last recorded status
tail(status, 1) %in% c("Exempt", "Incomplete") ~ "Exclude",
tail(status, 1) %in% c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired") ~ "Not Taken",
tail(status, 1) == "Registered" ~ "Registered",
TRUE ~ "Unknown"
)
)
print(status_current, n = Inf)
#> # A tibble: 6 x 2
#> name current
#> <chr> <chr>
#> 1 Angela Not Taken
#> 2 Bob Registered
#> 3 Claire Registered
#> 4 Gil Not Taken
#> 5 Joseph Registered
#> 6 Justin Completed
由reprex package(v0.3.0)于2019-06-17创建
编辑:关于您进行近似匹配的评论:我修改了示例,以允许近似匹配达到一定的编辑距离。您可能想对其进行调整,但是允许最多进行三个左右的编辑似乎是合理的。但是请注意,“未完成”和“已完成”之间的区别只是四个修改。
library(tidyverse)
problem <- tibble(name = c("Angela", "Claire", "Justin", "Bob", "Joseph", "Gil"),
status_1 = c("Registered", "Withdrawn", "Completed", "Registered", "Registered", "Registered"),
status_2 = c("Withdrawn", "Withdrawn", "Registered", "NA", "Withdrawn", "Cancelled"),
status_3 = c("NA", "Registered", "Withdrawn", "NA", "Registered", "NA"),
status_4 = c("Withdrawnn", "Registered", "Withdrawn", "NA", "Registered", "NA"))
status_wrangled <- problem %>%
gather(key = "time", value = "status", starts_with("status")) %>%
mutate(time = as.integer(str_split_fixed(time, "_", 2)[, 2])) %>%
arrange(name, time) %>%
filter(status != "NA")
# Find if input vector matches to *any* given pattern below the specified edit distance
any_fuzzy_match <- function(x, patterns, max.distance = 3) {
matches <- map(paste0("^", patterns, "$"), agrepl, x = x, max.distance = max.distance, fixed = FALSE)
reduce(matches, `|`)
}
status_current <- status_wrangled %>%
group_by(name) %>%
summarize(
current = case_when(
# Has student completed at any time?
any(any_fuzzy_match(status, "Completed")) ~ "Completed",
# Examine last recorded status
any_fuzzy_match(tail(status, 1), c("Exempt", "Incomplete")) ~ "Exclude",
any_fuzzy_match(tail(status, 1), c("Withdrawn", "Cancelled", "No Show", "Denied", "Waitlist Expired")) ~ "Not Taken",
any_fuzzy_match(tail(status, 1), "Registered") ~ "Registered",
TRUE ~ "Unknown"
)
)
print(status_current, n = Inf)
#> # A tibble: 6 x 2
#> name current
#> <chr> <chr>
#> 1 Angela Not Taken
#> 2 Bob Registered
#> 3 Claire Registered
#> 4 Gil Not Taken
#> 5 Joseph Registered
#> 6 Justin Completed
由reprex package(v0.3.0)于2019-06-18创建