Kaggle Titanic:用于机舱预测的灾难决策树中的机器学习

时间:2015-06-11 03:00:41

标签: r machine-learning decision-tree kaggle

其中一个变量' Cabin'拥有大量的NA。我正在尝试使用决策树(rpart)来预测那些客舱不可用的乘客舱。

目前,这是我的数据表的结构,它是训练和测试集的一部分。

 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Pclass     : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
 $ Age        : num  22 38 26 35 35 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
 $ Survived   : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
 $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
 $ FamilyID   : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
 $ FamilyID2  : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
 $ Title      : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
 $ Surname    : chr  "Braund" "Cumings" "Heikkinen" "Futrelle" ...
 $ Cabin2     : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...

请注意我已经使用strsplit创建了Cabin2'它已经提取了“小屋”的字母。变量,对应于泰坦尼克号的甲板对我的理解。这显着减少了我与187战斗的关卡数量,其中包括' Cabin'使用' Cabin2进入8。'

我正在尝试使用以下代码来预测机舱卡座:

cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,

combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit,     combi[is.na(combi$Cabin2),])

我被R抛出的输出如下:

 Warning messages:
 1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  invalid factor level, NA generated
 2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L,   :
  number of items to replace is not a multiple of replacement length

我正在拼命试图理解这一点,因为我继续摆弄这些数据,但是我很清楚为什么这段代码对我来说不起作用。

0 个答案:

没有答案