基于另一列的集总因子

时间:2018-10-04 14:53:51

标签: r tidyverse forcats

该示例显示了不同工厂的生产产出的度量, 其中第一列表示工厂 最后一栏是产生的数量。

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
  factory production
1       A         15
2       A          2
3       B          1
4       B          1
5       B          2
6       B          1
7       B          2
8       C         20
9       D          5

现在,我想根据这些数据集中的工厂总产量将工厂划分为更少的层次。

使用普通的forcats :: fct_lump,我可以将它们按出现的行数(例如制作3个级别:

library(tidyverse)    
df %>% mutate(factory=fct_lump(factory,2))
      factory production
    1       A         15
    2       A          2
    3       B          1
    4       B          1
    5       B          2
    6       B          1
    7       B          2
    8   Other         20
    9   Other          5

但是我想基于总和(生产)将它们合并,保留前n = 2个工厂(按总产量计),然后合并其余工厂。所需结果:

1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

有什么建议吗?

谢谢!

3 个答案:

答案 0 :(得分:2)

这里的关键是要采用特定的哲学,以便根据工厂的总产量将工厂分组在一起。请注意,这种原则与(实际)数据集中的实际值有关。

选项1

在此示例中,将总产量等于或小于15的工厂组合在一起。如果您要进行其他分组,则可以修改阈值(例如,使用18代替15)

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other 

我在创建factory_new时没有删除(原始)factory列。

选项2

在此示例中,您可以根据工厂的生产对工厂进行排名/订购,然后可以选择一些顶级工厂以保持现状并将其余工厂分组

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

# get ranked factories based on sum production
df %>%
  group_by(factory) %>%
  summarise(SumProd = sum(production)) %>%
  arrange(desc(SumProd)) %>%
  pull(factory) -> vec_top_factories

# input how many top factories you want to keep
# rest will be grouped together
n = 2

# apply the grouping based on n provided
df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other 

答案 1 :(得分:0)

我们也可以通过使用base R创建逻辑条件来使用ave

df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

答案 2 :(得分:0)

只需指定权重参数 w

> df %>% 
+   mutate(factory = fct_lump_n(factory, 2, w = production))
  factory production
1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

注意:使用 forcats::fct_lump_n 因为不再推荐泛型 fct_lump