在titanic数据集

时间:2016-09-07 11:43:31

标签: r prediction

我有来自Kaggle网站的着名泰坦尼克数据集。我想用逻辑回归预测乘客的生存。我在R中使用glm()函数。我首先将我的数据帧(总行数= 891)分成两个数据帧,即train(从第1行到第800行)和test(从第801行到第891行)。 代码如下

`
>> data <- read.csv("train.csv", stringsAsFactors = FALSE)

>> names(data)

 `[1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"             "Age"         "SibSp"      
 [8] "Parch"       "Ticket"      "Fare"        "Cabin"       "Embarked" `  

#Replacing NA values in Age column with mean value of non NA values of Age.
>> data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)

#Converting sex into binary values. 1 for males and 0 for females.
>> sexcode <- ifelse(data$Sex == "male",1,0)
#dividing data into train and test data frames
>> train <- data[1:800,]

>> test <- data[801:891,]
#setting up the model using glm()

>> model <- glm(Survived~sexcode[1:800]+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))

#creating a data frame
>> newtest <- data.frame(sexcode[801:891],test$Age,test$Pclass,test$Fare)

>> prediction <- predict(model,newdata = newtest,type='response')

`

当我运行最后一行代码时

prediction <- predict(model,newdata = newtest,type='response')

我收到以下错误

  

eval(expr,envir,enclos)中的错误:找不到对象'Age'

任何人都可以解释问题所在。我检查了newteset变量,似乎没有任何问题。

以下是泰坦尼克数据集https://www.kaggle.com/c/titanic/download/train.csv

的链接

1 个答案:

答案 0 :(得分:2)

首先,您应该将sexcode直接添加到数据框:

data$sexcode <- ifelse(data$Sex == "male",1,0)

然后,正如我评论的那样,newtest数据框中的列名称存在问题,因为您手动创建了它。您可以直接使用test数据框。

所以这是您的完整工作代码:

  data <- read.csv("train.csv", stringsAsFactors = FALSE)
  data$Age[is.na(data$Age)] <- mean(data$Age, na.rm = TRUE)
  data$sexcode <- ifelse(data$Sex == "male",1,0)

  train <- data[1:800,]
  test <- data[801:891,]

  model <- glm(Survived~sexcode+Age+Pclass+Fare,family=binomial(link='logit'),data=train, control = list(maxit = 50))

  prediction <- predict(model,newdata = test,type='response')