我有以下数据集:
> str(train)
'data.frame': 4619 obs. of 110 variables:
$ UserID : int 1 2 5 6 7 8 9 11 12 13 ...
$ YOB : int 1938 1985 1963 1997 1996 1991 1995 1983 1984 1997 ...
$ Gender : Factor w/ 3 levels "","Female","Male": 3 2 3 3 3 2 3 3 2 2 ...
$ Income : Factor w/ 7 levels "","$100,001 - $150,000",..: 1 3 6 5 4 7 5 2 4 6 ...
$ HouseholdStatus: Factor w/ 7 levels "","Domestic Partners (no kids)",..: 5 6 5 6 6 6 6 5 5 6 ...
$ EducationLevel : Factor w/ 8 levels "","Associate's Degree",..: 1 8 1 7 4 5 4 3 7 4 ...
$ Party : Factor w/ 6 levels "","Democrat",..: 3 2 1 6 1 1 6 3 6 2 ...
$ Happy : int 1 1 0 1 1 1 1 1 0 0 ...
$ Q124742 : Factor w/ 3 levels "","No","Yes": 2 1 2 1 2 3 1 2 2 1 ...
$ Q124122 : Factor w/ 3 levels "","No","Yes": 1 3 3 3 2 3 1 3 3 1 ...
$ Q123464 : Factor w/ 3 levels "","No","Yes": 2 2 2 3 2 2 1 2 2 1 ...
$ Q123621 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 2 1 1 3 2 1 ...
$ Q122769 : Factor w/ 3 levels "","No","Yes": 2 2 2 1 3 1 1 2 2 2 ...
$ Q122770 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 1 1 2 3 3 ...
$ Q122771 : Factor w/ 3 levels "","Private","Public": 3 3 2 2 3 3 1 3 3 3 ...
$ Q122120 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 1 2 2 2 ...
$ Q121699 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 2 3 2 3 3 2 ...
$ Q121700 : Factor w/ 3 levels "","No","Yes": 2 3 2 2 3 3 2 2 2 2 ...
$ Q120978 : Factor w/ 3 levels "","No","Yes": 1 3 2 3 3 2 2 3 3 3 ...
$ Q121011 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 3 3 2 3 2 ...
$ Q120379 : Factor w/ 3 levels "","No","Yes": 2 3 3 2 3 3 2 2 2 3 ...
$ Q120650 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 3 ...
$ Q120472 : Factor w/ 3 levels "","Art","Science": 1 3 3 3 3 2 3 3 2 3 ...
$ Q120194 : Factor w/ 3 levels "","Study first",..: 3 2 3 2 2 3 3 3 3 3 ...
$ Q120012 : Factor w/ 3 levels "","No","Yes": 2 3 3 1 2 3 2 2 3 3 ...
$ Q120014 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 3 1 3 3 2 3 ...
$ Q119334 : Factor w/ 3 levels "","No","Yes": 1 3 2 2 2 3 2 3 2 2 ...
$ Q119851 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 2 2 3 2 2 3 ...
$ Q119650 : Factor w/ 3 levels "","Giving","Receiving": 1 2 2 3 2 1 2 2 2 3 ...
$ Q118892 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 2 1 3 2 2 ...
$ Q118117 : Factor w/ 3 levels "","No","Yes": 3 2 2 3 3 3 1 2 2 2 ...
$ Q118232 : Factor w/ 3 levels "","Idealist",..: 2 2 3 3 3 1 1 2 2 3 ...
$ Q118233 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 3 2 ...
$ Q118237 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 2 ...
$ Q117186 : Factor w/ 3 levels "","Cool headed",..: 1 2 2 2 1 3 1 2 3 1 ...
$ Q117193 : Factor w/ 3 levels "","Odd hours",..: 1 2 3 2 3 3 1 3 3 3 ...
$ Q116797 : Factor w/ 3 levels "","No","Yes": 3 3 2 2 2 1 1 2 2 1 ...
$ Q116881 : Factor w/ 3 levels "","Happy","Right": 2 2 3 3 2 2 1 2 2 1 ...
$ Q116953 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 1 3 3 3 3 1 ...
$ Q116601 : Factor w/ 3 levels "","No","Yes": 3 3 3 2 3 3 1 3 3 1 ...
$ Q116441 : Factor w/ 3 levels "","No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
$ Q116448 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 2 1 2 3 1 ...
$ Q116197 : Factor w/ 3 levels "","A.M.","P.M.": 3 2 2 2 2 3 1 2 3 1 ...
$ Q115602 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 1 3 2 1 ...
$ Q115777 : Factor w/ 3 levels "","End","Start": 3 2 3 3 3 3 1 3 2 1 ...
$ Q115610 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 1 1 3 2 1 ...
$ Q115611 : Factor w/ 3 levels "","No","Yes": 2 2 3 3 2 2 1 2 2 1 ...
$ Q115899 : Factor w/ 3 levels "","Circumstances",..: 2 3 3 2 2 3 1 2 3 1 ...
$ Q115390 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 1 2 3 3 2 1 ...
$ Q114961 : Factor w/ 3 levels "","No","Yes": 3 3 2 3 2 3 2 2 3 1 ...
$ Q114748 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 3 3 2 3 1 ...
$ Q115195 : Factor w/ 3 levels "","No","Yes": 3 3 3 3 3 2 3 3 3 1 ...
$ Q114517 : Factor w/ 3 levels "","No","Yes": 2 3 2 3 2 2 2 2 3 1 ...
$ Q114386 : Factor w/ 3 levels "","Mysterious",..: 1 3 3 2 2 3 3 3 3 1 ...
$ Q113992 : Factor w/ 3 levels "","No","Yes": 3 1 3 2 2 2 2 2 3 1 ...
$ Q114152 : Factor w/ 3 levels "","No","Yes": 3 2 2 2 3 2 2 2 2 1 ...
$ Q113583 : Factor w/ 3 levels "","Talk","Tunes": 2 3 2 3 3 3 3 2 3 1 ...
$ Q113584 : Factor w/ 3 levels "","People","Technology": 3 2 2 3 2 1 3 2 2 1 ...
$ Q113181 : Factor w/ 3 levels "","No","Yes": 2 3 3 3 2 3 3 2 2 1 ...
[list output truncated]
如您所见,我有110个变量。我正在尝试构建一个预测模型来使用这些变量来预测幸福。如果我将它们保留在因子形式(CART模型,randomForest等等),那么我试图将它们转换为矢量化或数字类型(以使算法的生活更容易)...
目前我正逐一这样做:
> table(train_new$Q117193)
Odd hours Standard hours
1410 1299 1910
> train_new$Q117193 = as.integer(train_new$Q117193)
> table(train_new$Q117193)
1 2 3
1410 1299 1910
您可以注意到,几乎所有因子变量都有缺少的值,用""表示。 我已使用以下方法将此数据集转换为数字:
train_numeric$Gender = as.integer(train_numeric$Gender)
train_numeric[,grep(pattern="^Q1",colnames(train_numeric))] = lapply(train_numeric[,grep(pattern="^Q1",colnames(train_numeric))],as.integer)
我正在使用鼠标包来估算这个数据集...说实话,我很遗憾。有什么想法可以填写这些缺失值吗?
答案 0 :(得分:0)
您似乎正在将factor
变量(如性别)转换为numeric
格式,据我所知,这种情况下不可能,因为它们包含字符串,因此您只能将它们转换为{ {1}}我相信。
要在数据框character
中使用NA
替换所有缺失值(“”),您可以执行类似
train
答案 1 :(得分:0)
您可以在导入文件时更正此问题。我假设您导入了csv文件,因此该代码的代码为
数据集< -read.csv(file =“file location”,sep =“,”,header = True,na。 strings = c(“”,“NA”))
它会在分类变量
中用NA替换你的空白