所以,我有一个由以下块生成的DataFrame:
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult <- read.csv(url ,strip.white = TRUE ,header = FALSE )
colnames( adult ) <- c("age"," workclass "," final weight ","education "," education -num"," martial - status ","
occupation "," relationship "," race ","sex"," capital-gain "," capital - loss ","hours -per - week ","native -
country ","income")
&#34;收入&#34;中的值列是&#34;&lt; = 50k&#34;或&#34;&gt; 50k&#34;。当我尝试选择收入为&#34;&gt; 50k&#34;的人时,我使用以下命令:
richs = adult[adult["income"] == ">50k",]
但是,富文件DataFrame始终为空。我究竟做错了什么? 感谢。
答案 0 :(得分:0)
首先,我将数据下载到一个数据框中,字符串为因子:
>adults <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = FALSE)
> str(adults)
'data.frame': 32561 obs. of 15 variables:
$ V1 : int 39 50 38 53 28 37 49 52 31 42 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
$ V3 : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
$ V5 : int 13 13 9 7 13 14 5 9 14 13 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
$ V11: int 2174 0 0 0 0 0 0 0 14084 5178 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 40 13 40 40 40 40 16 45 50 40 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
如果您查看数据关闭,您会注意到您正在处理的功能是一个有两个类的因素:1 =&#34;&lt; = 50K&#34;和2 =&#34;&gt; 50K&#34;。使用此功能的第2类提取样本的一种快速方法是将其转换为整数并对其执行操作:
> richadults = adults[as.integer(adults$V15) == 2, ]
> str(richadults)
'data.frame': 7841 obs. of 15 variables:
$ V1 : int 52 31 42 37 30 40 43 40 56 54 ...
$ V2 : Factor w/ 9 levels " ?"," Federal-gov",..: 7 5 5 5 8 5 7 5 3 1 ...
$ V3 : int 209642 45781 159449 280464 141297 121772 292175 193524 216851 180211 ...
$ V4 : Factor w/ 16 levels " 10th"," 11th",..: 12 13 10 16 10 9 13 11 10 16 ...
$ V5 : int 9 14 13 10 13 11 14 16 13 10 ...
$ V6 : Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 3 5 3 3 3 3 1 3 3 3 ...
$ V7 : Factor w/ 15 levels " ?"," Adm-clerical",..: 5 11 5 5 11 4 5 11 14 1 ...
$ V8 : Factor w/ 6 levels " Husband"," Not-in-family",..: 1 2 1 1 1 1 5 1 1 1 ...
$ V9 : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 2 2 5 5 5 2 ...
$ V10: Factor w/ 2 levels " Female"," Male": 2 1 2 2 2 2 1 2 2 2 ...
$ V11: int 0 14084 5178 0 0 0 0 0 0 0 ...
$ V12: int 0 0 0 0 0 0 0 0 0 0 ...
$ V13: int 45 50 40 80 40 40 45 60 40 60 ...
$ V14: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 20 1 40 40 40 36 ...
$ V15: Factor w/ 2 levels " <=50K"," >50K": 2 2 2 2 2 2 2 2 2 2 ...
在新数据框架(richadults)中,只有那些收入> 50K的人才能获得7 841个样本。原始数据集有32 561个样本。