在进行数据分析时,我有时需要将值重新编码为因子以进行组分析。我希望保持因子的顺序与case_when
中指定的转换顺序相同。在这种情况下,订单应为"Excellent" "Good" "Fail"
。如何在levels=c('Excellent', 'Good', 'Fail')
中不再乏味地再次提及它,我怎样才能做到这一点?
非常感谢你。
library(dplyr, warn.conflicts = FALSE)
set.seed(1234)
score <- runif(100, min = 0, max = 100)
Performance <- function(x) {
case_when(
is.na(x) ~ NA_character_,
x > 80 ~ 'Excellent',
x > 50 ~ 'Good',
TRUE ~ 'Fail'
) %>% factor(levels=c('Excellent', 'Good', 'Fail'))
}
performance <- Performance(score)
levels(performance)
#> [1] "Excellent" "Good" "Fail"
table(performance)
#> performance
#> Excellent Good Fail
#> 15 30 55
最后,我想出了一个解决方案。对于那些感兴趣的人,这是我的解决方案。我写了一个函数fct_case_when
(假装是forcats
中的一个函数)。它只是带有因子输出的case_when
的包装器。级别的顺序与参数顺序相同。
fct_case_when <- function(...) {
args <- as.list(match.call())
levels <- sapply(args[-1], function(f) f[[3]]) # extract RHS of formula
levels <- levels[!is.na(levels)]
factor(dplyr::case_when(...), levels=levels)
}
现在,我可以使用fct_case_when
代替case_when
,结果将与之前的实现相同(但不那么繁琐)。
Performance <- function(x) {
fct_case_when(
is.na(x) ~ NA_character_,
x > 80 ~ 'Excellent',
x > 50 ~ 'Good',
TRUE ~ 'Fail'
)
}
performance <- Performance(score)
levels(performance)
#> [1] "Excellent" "Good" "Fail"
table(performance)
#> performance
#> Excellent Good Fail
#> 15 30 55
答案 0 :(得分:2)
级别按字典顺序设置。如果您不想指定它们,可以设置它们以使字典顺序正确(Performance1
),或者创建一次levels
向量,并在生成因子时使用它设置级别(Performance2
)时。我不知道这些会有多少努力或乏味可以拯救你,但在这里它们是。看看我的第三个建议,我认为这是最乏味的方式。
Performance1 <- function(x) {
case_when(
is.na(x) ~ NA_character_,
x > 80 ~ 'Excellent',
x <= 50 ~ 'Fail',
TRUE ~ 'Good',
) %>% factor()
}
Performance2 <- function(x, levels = c("Excellent", "Good", "Fail")){
case_when(
is.na(x) ~ NA_character_,
x > 80 ~ levels[1],
x > 50 ~ levels[2],
TRUE ~ levels[3]
) %>% factor(levels)
}
performance1 <- Performance1(score)
levels(performance1)
# [1] "Excellent" "Fail" "Good"
table(performance1)
# performance1
# Excellent Fail Good
# 15 55 30
performance2 <- Performance2(score)
levels(performance2)
# [1] "Excellent" "Good" "Fail"
table(performance2)
# performance2
# Excellent Good Fail
# 15 30 55
如果我可以建议一种更乏味的方式:
performance <- cut(score, breaks = c(0, 50, 80, 100),
labels = c("Fail", "Good", "Excellent"))
levels(performance)
# [1] "Fail" "Good" "Excellent"
table(performance)
# performance
# Fail Good Excellent
# 55 30 15
答案 1 :(得分:1)
虽然我的解决方案用一个凌乱的中间变量替换你的管道,但这可行:
library(dplyr, warn.conflicts = FALSE)
set.seed(1234)
score <- runif(100, min = 0, max = 100)
Performance <- function(x) {
t <- case_when(
is.na(x) ~ NA_character_,
x > 80 ~ 'Excellent',
x > 50 ~ 'Good',
TRUE ~ 'Fail'
)
to <- subset(t, !duplicated(t))
factor(t, levels=(to[order(subset(x, !duplicated(t)), decreasing=T)] ))
}
performance <- Performance(score)
levels(performance)
编辑修复!