尝试对它们进行分组时,为什么会得到一个附加类别?

时间:2019-02-17 12:19:36

标签: r

我正在尝试使用gsub根据现有值在“成人收入”数据集中对一些工作类别类别进行分组。但是,我最终得到了“其他未知”类别的两个版本。有人可以帮我理解为什么吗? nas / null有?在田野里。预先谢谢你!

下面是我的代码

total_data <- read.csv("adult_data_set.csv")

levels(total_data$workclass)[1] <- "Unknown"
total_data$workclass <- gsub("Federal-gov", "Public Sector",total_data$workclass)
total_data$workclass <- gsub("Local-gov", "Public Sector", total_data$workclass)
total_data$workclass <- gsub("State-gov", "Public Sector", total_data$workclass)
total_data$workclass <- gsub("Self-emp-inc", "Self Employed", total_data$workclass)
total_data$workclass <- gsub("Self-emp-not-inc", "Self Employed", total_data$workclass) 
total_data$workclass <- gsub("Never-worked", "Other-Unknown", total_data$workclass) #this is part of the 17 count
total_data$workclass <- gsub("Without-pay", "Other-Unknown", total_data$workclass) #this is part of the 17 count
total_data$workclass <- gsub("^Unknown", "Other-Unknown", total_data$workclass)

total_data$workclass <- as.factor(total_data$workclass)

这是我得到的结果

 Other-Unknown        Private  Public Sector  Self Employed  Other-Unknown 
            17          22333           4335           3716           1859 

我期待

 Other-Unknown        Private  Public Sector  Self Employed   
            1876          22333           4335           3716                

2 个答案:

答案 0 :(得分:3)

似乎在CSV文件中,这些字段由逗号和空格分隔。

因此您的其他未知级别并不完全相同,有些是Other-Unknown,而有些是whitespaceOther-Unknown

在这种情况下,您可以将strip.white=TRUE选项添加到read.csv命令中,这将删除字段开头和结尾的空格。

答案 1 :(得分:0)

查看下面的代码是否满足您的要求。它使用包forcats函数fct_collapse折叠因子水平。
这是 unested ,因为问题中没有示例数据集。

library(forcats)

all_workclass <- levels(total_data$workclass)
public <- c("Federal-gov", "Local-gov", "State-gov")
selfemp <- c("Self-emp-inc", "Self-emp-not-inc")
other <- setdiff(all_workclass, c(public, selfemp))

total_data$workclass <- fct_collapse(total_data$workclass,
  'Public Sector' = public,
  'Self Employed' = selfemp,
  'Other-Unknown' = other
)