Question

我有一个包含13个变量和100000个观测值的数据集。

一个名为item_color的列描述了该项目的颜色，它有85个级别。我想重新组合罕见的颜色，这样我就可以减少级别的数量。我的门槛是200;所以如果数据集中的颜色小于200我想要放入“其他”颜色组。

我知道，我可以使用length查找他们的号码。但是，我找不到合适的逻辑来创建代码。我写了这个：

order$item_color <-
  ifelse(length(order$item_color[order$item_color]) < 200, "Other", order$item_color)

但它用“其他”替换了所有颜色。

Answer 1

正如@lmo所指出，您可以使用table。您应该在将来提供示例数据。

无需使用ifelse，您可以使用%in%将表格中少于200个计数的所有颜色设置为“其他”：

# Create dummy data
set.seed(1)
item_color <- sample(c("red","blue","green"), 1000, replace = T)
item_color[sample(1:1000,10)] <- "purple"
item_color[sample(1:1000,10)] <- "yellow"
item_color[sample(1:1000,10)] <- "orange"
order <- data.frame(item_color = item_color, stringsAsFactors = F)

table(order$item_color)
#  blue  green orange purple    red yellow 
#   337    321     10     10    312     10 

# The actual solution
table_colors <- table(order$item_color)
order[order$item_color %in% names(table_colors)[table_colors < 200],"item_color"] <- "Other"

table(order$item_color)
# blue green Other   red 
#  337   321    30   312

修改您为order使用了名称data.frame，还有一个名为order()的基本函数，您应该避免使用覆盖现有函数的名称。

根据R中的数字重新组合因子的罕见值

1 个答案: