其中一个变量' Cabin'拥有大量的NA。我正在尝试使用决策树(rpart)来预测那些客舱不可用的乘客舱。
目前,这是我的数据表的结构,它是训练和测试集的一部分。
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 187 levels "","A10","A14",..: NA 83 NA 57 NA NA 131 NA NA NA ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
$ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
$ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
$ FamilyID : Factor w/ 8 levels "11","3","4","5",..: 8 8 8 8 8 8 8 4 2 8 ...
$ FamilyID2 : Factor w/ 7 levels "11","4","5","6",..: 7 7 7 7 7 7 7 3 7 7 ...
$ Title : Factor w/ 11 levels "Col","Dr","Lady",..: 7 8 5 8 7 7 7 4 8 8 ...
$ Surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ Cabin2 : Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
请注意我已经使用strsplit创建了Cabin2'它已经提取了“小屋”的字母。变量,对应于泰坦尼克号的甲板对我的理解。这显着减少了我与187战斗的关卡数量,其中包括' Cabin'使用' Cabin2进入8。'
我正在尝试使用以下代码来预测机舱卡座:
cabinFit <- rpart(Cabin2 ~ Age + Sex + Fare + Embarked + SibSp + Parch + Title + FamilySize + FamilyID,
combi$Cabin2[is.na(combi$Cabin2)] <- predict(cabinFit, combi[is.na(combi$Cabin2),])
我被R抛出的输出如下:
Warning messages:
1: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, is.na(combi$Cabin2), value = c(NA, 3L, :
number of items to replace is not a multiple of replacement length
我正在拼命试图理解这一点,因为我继续摆弄这些数据,但是我很清楚为什么这段代码对我来说不起作用。