我正在评估RStudio的子集功能,我有一个公共数据集,子集函数似乎不起作用。
数据集是Adult数据集。 http://archive.ics.uci.edu/ml/datasets/Adult
我正在创建数据框,如下所示:
adult <- read.csv("~/Resources/Test ML Data/Adult/adult.data", header=FALSE)
colnames(adult) <- c("age","workclass","final weight","education","education-num","martial-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","income")
adult[["final weight"]] <- NULL
adult[["education-num"]] <- NULL
adult[["age"]] <- ordered(cut(adult[["age"]],c(15,25,45,65,100)),labels = c("Young","Middle-Aged","Senior","Old"))
adult[["hours-per-week"]] <- ordered(cut(adult[["hours-per-week"]],c(0,25,40,60,168)),labels = c("Part-Time","Full-Time","Over-Time","Workaholic"))
adult[["capital-gain"]] <- ordered(cut(adult[["capital-gain"]],c(-Inf,0,median(adult[["capital-gain"]][adult[["capital-gain"]]>0]),Inf)),labels = c("None","Low","High"))
adult[["capital-loss"]] <- ordered(cut(adult[["capital-loss"]],c(-Inf,0,median(adult[["capital-loss"]][adult[["capital-loss"]]>0]),Inf)),labels = c("None","Low","High"))
然后我尝试在任何列上对数据进行子集化。
adult_t <- adult[adult["sex"] != "Female", ]
adult_t数据框与原始数据帧相同。 我也试过各种变化。
adult_t <- subset(adult,adult$sex != "Female")
相同的结果
我可以对葡萄酒数据集等其他数据集进行子集化。 (也位于同一网站上)
wine <- read.csv("~/Resources/Test ML Data/Wine/wine.data", header=FALSE)
colnames(wine) <- c("class","Alcohol","Malic Acid","Ash","Alcalinity of Ash","Magnesium","Total Phenols","Flavanoids","Nonflavanoid Phenols","Proanthocyanins","Color Intensity","Hue","0D280/OD315 of Diluted Wines","Proline")
wine_t <- wine[wine["Magnesium"] > 100, ]
这可以正常工作。
我无法理解为什么成人数据集不是子集。我是 R 的新手,所以对这里发生的事情的任何见解都会有所帮助。
我正在使用版本 0.98.981 的RStudio和版本 3.1.1 的R
答案 0 :(得分:2)
数据集的问题是每个逗号后面的空格,这些空格对于csv来说是非标准的。您可以使用read.csv中的strip.white
解决此问题。
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult <- read.csv(url, strip.white = TRUE, header = FALSE)