该示例显示了不同工厂的生产产出的度量, 其中第一列表示工厂 最后一栏是产生的数量。
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 C 20
9 D 5
现在,我想根据这些数据集中的工厂总产量将工厂划分为更少的层次。
使用普通的forcats :: fct_lump,我可以将它们按出现的行数(例如制作3个级别:
library(tidyverse)
df %>% mutate(factory=fct_lump(factory,2))
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 Other 20
9 Other 5
但是我想基于总和(生产)将它们合并,保留前n = 2个工厂(按总产量计),然后合并其余工厂。所需结果:
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
有什么建议吗?
谢谢!
答案 0 :(得分:2)
这里的关键是要采用特定的哲学,以便根据工厂的总产量将工厂分组在一起。请注意,这种原则与(实际)数据集中的实际值有关。
选项1
在此示例中,将总产量等于或小于15的工厂组合在一起。如果您要进行其他分组,则可以修改阈值(例如,使用18代替15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
我在创建factory_new
时没有删除(原始)factory
列。
选项2
在此示例中,您可以根据工厂的生产对工厂进行排名/订购,然后可以选择一些顶级工厂以保持现状并将其余工厂分组
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
答案 1 :(得分:0)
我们也可以通过使用base R
创建逻辑条件来使用ave
df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]
答案 2 :(得分:0)
只需指定权重参数 w
:
> df %>%
+ mutate(factory = fct_lump_n(factory, 2, w = production))
factory production
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
注意:使用 forcats::fct_lump_n
因为不再推荐泛型 fct_lump
。