即使在删除一个级别后,因子级别仍保持不变

时间:2016-08-05 14:46:29

标签: r

我正在尝试Kaggle上的泰坦尼克号机器学习数据集示例,我遇到了以下问题。 错误消息显示为:

Error in predict.randomForest(modelFit, newtest) : 
Type of predictors in new data do not match that of the training data.

这是我的全部代码:

#Load the libraries:
library(ggplot2)
library(randomForest)

#Load the data:
set.seed(1)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
gendermodel <- read.csv("gendermodel.csv")
genderclassmodel <- read.csv("genderclassmodel.csv")

#Preprocess the data and feature extraction:
features <- c("Pclass", "Age", "Sex", "Parch", "SibSp", "Fare", "Embarked")                  

newtrain <- train[,features]
newtest <- test[,features]

newtrain$Embarked[newtrain$Embarked==""] <- "S"
newtrain$Fare[newtrain$Fare == 0] <- median(newtrain$Fare, na.rm=TRUE)
newtrain$Age[is.na(newtrain$Age)] <- -1

newtest$Embarked[newtest$Embarked==""] <- "S"
newtest$Fare[newtest$Fare == 0] <- median(newtest$Fare, na.rm=TRUE)
newtest$Fare <- ifelse(is.na(newtest$Fare), mean(newtest$Fare, na.rm = TRUE), newtest$Fare)
newtest$Age[is.na(newtest$Age)] <- -1

#Model building

modelFit <- randomForest(newtrain, as.factor(train$Survived), ntree = 100, importance = TRUE)
predictedOutput <- data.frame(PassengerID = test$PassengerId)
predictedOutput$Survived <- predict(modelFit, newtest)
write.csv(predictedOutput, file = "TitanicPrediction.csv", row.names=FALSE)

MDA <- importance(modelFit, type=1)
featureImportance <- data.frame(Feature = row.names(MDA), Importance = MDA[,1])

#Plots
g <- ggplot(featureImportance, aes(x=Feature, y=Importance)) + geom_bar(stat="identity") + xlab("Feature") + ylab("Importance") + ggtitle("Feature importance")
ggsave("FeatureImportance.png", p)

我理解错误消息的含义,因此当我执行str(newtrain)str(newtest)时,即使在分配newtrain$Embarked[newtrain$Embarked==""] <- "S"后,我也会得到以下信息。

str(newtrain)
'data.frame':   891 obs. of  7 variables:
 $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Age     : num  22 38 26 35 35 -1 54 2 27 14 ...
 $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Parch   : int  0 0 0 0 0 0 0 1 2 0 ...
 $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
> length(which(train$Embarked == ""))
[1] 2
> length(which(newtrain$Embarked == ""))
[1] 0

当我检查包含缺失值的train和newtrain数据集的长度时,我得到如上所示的正确输出。我不确定我哪里出错了。任何帮助深表感谢!谢谢!

1 个答案:

答案 0 :(得分:0)

在你的行之后,

String actJsonBody = (String) actResponse.getEntity();

做的:

newtrain$Embarked[newtrain$Embarked==""] <- "S"

这将重置已修改newtrain$Embarked <- factor(newtrain$Embarked) 的因子级别。

此外,在您发布的代码的最后一行,newtrain$Embarked应为p

与Kaggle一起好运!