Question

在以前的R版本中，我可以结合没有＆＃34;重要＆＃34;使用以下小函数的音量阈值：

whittle = function(data, cutoff_val){
  #convert to a data frame
  tab = as.data.frame.table(table(data))
  #returns vector of indices where value is below cutoff_val
  idx = which(tab$Freq < cutoff_val)
  levels(data)[idx] = "Other"
  return(data)
}

这需要一个因子向量，寻找不会出现的等级＆＃34;经常足够的＆＃34;并将所有这些级别合并为一个＆＃34;其他＆＃34;要素水平。这方面的一个例子如下：

> sort(table(data$State))

   05    27    35    40    54    84     9    AP    AU    BE    BI    DI     G    GP    GU    GZ    HN    HR    JA    JM    KE    KU     L    LD    LI    MH    NA 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     1 
   OU     P    PL    RM    SR    TB    TP    TW     U    VD    VI    VS    WS     X    ZH    47    BL    BS    DL     M    MB    NB    RP    TU    11    DU    KA 
    1     1     1     1     1     1     1     1     1     1     1     1     1     1     1     2     2     2     2     2     2     2     2     2     3     3     3 
   BW    ND    NS    WY    AK    SD    13    QC    01    BC    MT    AB    HE    ID     J    NO    LN    NM    ON    NE    VT    UT    IA    MS    AO    AR    ME 
    4     4     4     4     5     5     6     6     7     7     7     8     8     8     9    10    11    17    23    26    26    30    31    31    38    40    44 
   OR    KS    HI    NV    WI    OK    KY    IN    WV    AL    CO    WA    MN    NH    MO    SC    LA    TN    AZ    IL    NC    MI    GA    OH    **    CT    DE 
   45    47    48    57    57    64   106   108   112   113   120   125   131   131   135   138   198   200   233   492   511   579   645   646   840   873  1432 
   RI    DC    TX    MA    FL    VA    MD    CA    NJ    PA    NY 
 1782  2513  6992  7027 10527 11016 11836 12221 15485 16359 34045

现在，当我使用whittle时，它会返回以下消息：

> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other",  :
  duplicated levels in factors are deprecated

如何修改我的功能以使其具有相同的效果但不使用这些功能＆＃34;已弃用＆＃34;因子水平？转换为字符，制表，然后转换为字符＆＃34;其他＆＃34;？

Answer 1

我总是觉得转换为角色并回到这些类型的操作最简单（减少打字和减少头痛）。与as.data.frame.table保持一致并使用replace替换低频级别：

whittle <- function(data, cutoff_val) {
  tab = as.data.frame.table(table(data))
  factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}

测试一些样本数据：

state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD    MD    MD    Other Other
# Levels: MD Other

Answer 2

我认为这个版本应该有效。 levels<-功能允许您通过分配列表来折叠（请参阅?levels）。

whittle <- function(data, cutoff_val){
  tab <- table(data)
  shouldmerge <- tab < cutoff_val
  tokeep <- names(tab)[!shouldmerge]
  tomerge <- names(tab)[shouldmerge]
  nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
  levels(data)<-nv
  return(data)
}

我们用

测试它

set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
#  a  b  c  d  e  f  g  h  i  j  k  l  m 
#  5 11  8  8  7  5 13 14 14 15  2  3  5 

y <- whittle(x, 9)
table(y)
# y
#     b     g     h     i     j Other 
#    11    13    14    14    15    43

Answer 3

值得补充的是，新的forcats软件包包含专用于此的fct_lump()函数。

使用@ MrFlick的数据：

x <- factor(c(sample(letters[1:10], 100, replace=T), 
              sample(letters[11:13], 10, replace=T)))

library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table

# b     g     h     i     j Other 
#11    13    14    14    15    43

n参数指定保留的最常见值的数量。

Answer 4

这是另一种方法，通过用第一个替换阈值以下的所有项目，然后将该级别重命名为其他。

whittle <- function(x, thresh) {
  belowThresh <- names(which(table(x) < thresh))
  x[x %in% belowThresh] <- belowThresh[1]
  levels(x)[levels(x) == belowThresh[1]] <- "Other"
  factor(x)
}

结合R 3.2.1

4 个答案: