消除影响较小的因素

时间:2017-08-30 07:54:10

标签: r data-science

一列中有数百个级别,并非所有级别都真正增加价值 - 例如,大约60%的级别占< 80%(它们在数据帧中不会多次出现)和也有望不影响结果。目标是消除那些不超过80%的水平。 有人可以帮忙吗?提前致谢

1 个答案:

答案 0 :(得分:1)

这是一个简单的过程,可以找到占数据集(行)不到80%的值,并使用新值将它们组合在一起。此过程使用字符列而不是因子列。

library(dplyr)

# example dataset
dt = data.frame(type = c("A","A","A","B","B","B","c","D"),
                value = 1:8, stringsAsFactors = F)

dt

#   type value
# 1    A     1
# 2    A     2
# 3    A     3
# 4    B     4
# 5    B     5
# 6    B     6
# 7    c     7
# 8    D     8

# count number of rows for each type
dt %>% count(type)

# # A tibble: 4 x 2
#    type     n
#   <chr> <int>
# 1     A     3
# 2     B     3
# 3     c     1
# 4     D     1

# add cumulative percentages
dt %>% 
  count(type) %>% 
  mutate(Prc = n/sum(n),
         CumPrc = cumsum(Prc))

# # A tibble: 4 x 4
#    type     n   Prc CumPrc
#   <chr> <int> <dbl>  <dbl>
# 1     A     3 0.375  0.375
# 2     B     3 0.375  0.750
# 3     c     1 0.125  0.875
# 4     D     1 0.125  1.000

# pick the types you want to group together
dt %>% 
  count(type) %>% 
  mutate(Prc = n/sum(n),
         CumPrc = cumsum(Prc)) %>%
  filter(CumPrc > 0.80) %>%
  pull(type) -> types_to_group

# group them
dt %>% mutate(type_upd = ifelse(type %in% types_to_group, "Rest", type))

#   type value type_upd
# 1    A     1        A
# 2    A     2        A
# 3    A     3        A
# 4    B     4        B
# 5    B     5        B
# 6    B     6        B
# 7    c     7     Rest
# 8    D     8     Rest