将具有较低细胞计数的值分组到

时间:2017-06-18 17:22:46

标签: r

rstudio 3.4.0 32位(64位操作系统)windows 10

分析并运行kaggle内核以获得泰坦尼克号,没有错误,也没有结果。

 str(full)
'data.frame':   1309 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley 
(Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 $ Title      : chr  " Mr" " Mrs" " Miss" " Mrs" ...

从乘客姓名中获取标题:

full$Title <- gsub('(.*,)|(\\..*)','',full$Name)

# Show title counts by sex
table(full$Sex, full$Title)

# Titles with very low cell counts to be combined to "rare" level
rare_title <- c ('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don', 
                 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

# Also reassign mlle, ms, and mme accordingly
full$Title[full$Title == 'Mlle']        <- 'Miss' 
full$Title[full$Title == 'Ms']          <- 'Miss'
full$Title[full$Title == 'Mme']         <- 'Mrs' 
full$Title[full$Title %in% rare_title]  <- 'Rare Title'

# Show title counts by sex again
table(full$Sex, full$Title)

        Capt  Col  Don  Dona  Dr  Jonkheer  Lady  Major  Master  Miss  Mlle
  female     0    0    0     1   1         0     1      0       0   260     2
  male       1    4    1     0   7         1     0      2      61     0     0

          Mme  Mr  Mrs  Ms  Rev  Sir  the Countess
  female    1   0  197   2    0    0             1
  male      0 757    0   0    8    1             0

我无法理解为什么值没有被分组到罕见级别,尽管我没有错误。那么为什么会这样呢?

1 个答案:

答案 0 :(得分:1)

问题是你的标题前面有白色空格。正如您在str(full)中看到的那样,标题与" Mr"类似,而不是"Mr"

您可以使用trimws

进行修复
full <- data.frame(Title=c(" Mr", " Mrs", " Miss", " Major"," Don"),
                   age=1:5,stringsAsFactors = FALSE)
rare_title <- c ('Dona', 'Lady', 'the Countess','Capt', 'Col', 'Don'
                 ,'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')
full$Title[trimws(full$Title) %in% rare_title]  <- 'Rare Title'

[1] " Mr"        " Mrs"       " Miss"      "Rare Title" "Rare Title"