根据另一个变量的值重新设置因子

时间:2018-08-02 16:49:04

标签: r tidyverse tidyr categorical-data forcats

我想根据另一个变量的值来重新调整因子变量的级别。例如:

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"
), count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
"tbl", "data.frame"))

> factors
# A tibble: 5 x 2
  color  count
  <chr>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


这是我想生产的:

##Group all levels with count < 10 into "OTHER"

> factors.out
# A tibble: 3 x 2
  color count
  <chr> <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19


我认为这是forcats::fct_lump()的工作:

##Keep 3 levels
factors %>%
+   mutate(color = fct_lump(color, n = 3))
# A tibble: 5 x 2
  color  count
  <fct>  <dbl>
1 RED        2
2 GREEN      5
3 BLUE      11
4 YELLOW     1
5 BROWN     19


我知道一个人可以用以下方法做到这一点:

factors %>%
  mutate(color = ifelse(count < 10, "OTHER", color)) %>%
  group_by(color) %>%
  summarise(count = sum(count))


但是我认为或希望forcats中有一个便捷功能。


1 个答案:

答案 0 :(得分:2)

因为您已经有一个包含因子和计数的data.frame,所以可以 将最罕见的观察结果汇总在一起时,使用计数作为权重。 第二阶段只是像您的示例中那样折叠OTHER类别。

factors <- structure(list(color = c("RED", "GREEN", "BLUE", "YELLOW", "BROWN"),
  count = c(2, 5, 11, 1, 19)), row.names = c(NA, -5L), class = c("tbl_df", 
  "tbl", "data.frame"))

library("dplyr")
library("forcats")

factors.out <- factors %>%
  mutate(color = fct_lump(color, n = 2, other_level = "OTHER",
    w = count)) %>%
  group_by(color) %>%
  summarise(count = sum(count)) %>%
  arrange(count)

给予

factors.out 
# A tibble: 3 x 2
  color count
  <fct>  <dbl>
1 OTHER     8
2 BLUE     11
3 BROWN    19