我正在尝试使用gsub根据现有值在“成人收入”数据集中对一些工作类别类别进行分组。但是,我最终得到了“其他未知”类别的两个版本。有人可以帮我理解为什么吗? nas / null有?在田野里。预先谢谢你!
下面是我的代码
total_data <- read.csv("adult_data_set.csv")
levels(total_data$workclass)[1] <- "Unknown"
total_data$workclass <- gsub("Federal-gov", "Public Sector",total_data$workclass)
total_data$workclass <- gsub("Local-gov", "Public Sector", total_data$workclass)
total_data$workclass <- gsub("State-gov", "Public Sector", total_data$workclass)
total_data$workclass <- gsub("Self-emp-inc", "Self Employed", total_data$workclass)
total_data$workclass <- gsub("Self-emp-not-inc", "Self Employed", total_data$workclass)
total_data$workclass <- gsub("Never-worked", "Other-Unknown", total_data$workclass) #this is part of the 17 count
total_data$workclass <- gsub("Without-pay", "Other-Unknown", total_data$workclass) #this is part of the 17 count
total_data$workclass <- gsub("^Unknown", "Other-Unknown", total_data$workclass)
total_data$workclass <- as.factor(total_data$workclass)
这是我得到的结果
Other-Unknown Private Public Sector Self Employed Other-Unknown
17 22333 4335 3716 1859
我期待
Other-Unknown Private Public Sector Self Employed
1876 22333 4335 3716
答案 0 :(得分:3)
似乎在CSV文件中,这些字段由逗号和空格分隔。
因此您的其他未知级别并不完全相同,有些是Other-Unknown
,而有些是whitespaceOther-Unknown
。
在这种情况下,您可以将strip.white=TRUE
选项添加到read.csv
命令中,这将删除字段开头和结尾的空格。
答案 1 :(得分:0)
查看下面的代码是否满足您的要求。它使用包forcats
函数fct_collapse
折叠因子水平。
这是 unested ,因为问题中没有示例数据集。
library(forcats)
all_workclass <- levels(total_data$workclass)
public <- c("Federal-gov", "Local-gov", "State-gov")
selfemp <- c("Self-emp-inc", "Self-emp-not-inc")
other <- setdiff(all_workclass, c(public, selfemp))
total_data$workclass <- fct_collapse(total_data$workclass,
'Public Sector' = public,
'Self Employed' = selfemp,
'Other-Unknown' = other
)