我正在尝试Kaggle上的泰坦尼克号机器学习数据集示例,我遇到了以下问题。 错误消息显示为:
Error in predict.randomForest(modelFit, newtest) :
Type of predictors in new data do not match that of the training data.
这是我的全部代码:
#Load the libraries:
library(ggplot2)
library(randomForest)
#Load the data:
set.seed(1)
train <- read.csv("train.csv")
test <- read.csv("test.csv")
gendermodel <- read.csv("gendermodel.csv")
genderclassmodel <- read.csv("genderclassmodel.csv")
#Preprocess the data and feature extraction:
features <- c("Pclass", "Age", "Sex", "Parch", "SibSp", "Fare", "Embarked")
newtrain <- train[,features]
newtest <- test[,features]
newtrain$Embarked[newtrain$Embarked==""] <- "S"
newtrain$Fare[newtrain$Fare == 0] <- median(newtrain$Fare, na.rm=TRUE)
newtrain$Age[is.na(newtrain$Age)] <- -1
newtest$Embarked[newtest$Embarked==""] <- "S"
newtest$Fare[newtest$Fare == 0] <- median(newtest$Fare, na.rm=TRUE)
newtest$Fare <- ifelse(is.na(newtest$Fare), mean(newtest$Fare, na.rm = TRUE), newtest$Fare)
newtest$Age[is.na(newtest$Age)] <- -1
#Model building
modelFit <- randomForest(newtrain, as.factor(train$Survived), ntree = 100, importance = TRUE)
predictedOutput <- data.frame(PassengerID = test$PassengerId)
predictedOutput$Survived <- predict(modelFit, newtest)
write.csv(predictedOutput, file = "TitanicPrediction.csv", row.names=FALSE)
MDA <- importance(modelFit, type=1)
featureImportance <- data.frame(Feature = row.names(MDA), Importance = MDA[,1])
#Plots
g <- ggplot(featureImportance, aes(x=Feature, y=Importance)) + geom_bar(stat="identity") + xlab("Feature") + ylab("Importance") + ggtitle("Feature importance")
ggsave("FeatureImportance.png", p)
我理解错误消息的含义,因此当我执行str(newtrain)
和str(newtest)
时,即使在分配newtrain$Embarked[newtrain$Embarked==""] <- "S"
后,我也会得到以下信息。
str(newtrain)
'data.frame': 891 obs. of 7 variables:
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Age : num 22 38 26 35 35 -1 54 2 27 14 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Embarked: Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
> length(which(train$Embarked == ""))
[1] 2
> length(which(newtrain$Embarked == ""))
[1] 0
当我检查包含缺失值的train和newtrain数据集的长度时,我得到如上所示的正确输出。我不确定我哪里出错了。任何帮助深表感谢!谢谢!
答案 0 :(得分:0)
在你的行之后,
String actJsonBody = (String) actResponse.getEntity();
做的:
newtrain$Embarked[newtrain$Embarked==""] <- "S"
这将重置已修改newtrain$Embarked <- factor(newtrain$Embarked)
的因子级别。
此外,在您发布的代码的最后一行,newtrain$Embarked
应为p
。
与Kaggle一起好运!