我想根据另一个变量的值来重新调整因子变量的级别。例如:
factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"
), count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
> factors
# A tibble: 5 x 2
color count
<chr> <dbl>
1 RED 2
2 GREEN 5
3 BLUE 11
4 YELLOW 1
5 BROWN 19
这是我想生产的:
##Group all levels with count < 10 into "OTHER"
> factors.out
# A tibble: 3 x 2
color count
<chr> <dbl>
1 OTHER 8
2 BLUE 11
3 BROWN 19
我认为这是forcats::fct_lump()
的工作:
##Keep 3 levels
factors %>%
+ mutate(color = fct_lump(color, n = 3))
# A tibble: 5 x 2
color count
<fct> <dbl>
1 RED 2
2 GREEN 5
3 BLUE 11
4 YELLOW 1
5 BROWN 19
我知道一个人可以用以下方法做到这一点:
factors %>%
mutate(color = ifelse(count < 10, "OTHER", color)) %>%
group_by(color) %>%
summarise(count = sum(count))
但是我认为或希望forcats
中有一个便捷功能。
答案 0 :(得分:2)
因为您已经有一个包含因子和计数的data.frame,所以可以 将最罕见的观察结果汇总在一起时,使用计数作为权重。 第二阶段只是像您的示例中那样折叠OTHER类别。
factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"),
count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df",
"tbl", "data.frame"))
library("dplyr")
library("forcats")
factors.out <- factors %>%
mutate(color = fct_lump(color, n = 2, other_level = "OTHER",
w = count)) %>%
group_by(color) %>%
summarise(count = sum(count)) %>%
arrange(count)
给予
factors.out
# A tibble: 3 x 2
color count
<fct> <dbl>
1 OTHER 8
2 BLUE 11
3 BROWN 19