尝试使用单独将一列拆分为多于两列

时间:2014-10-06 20:53:02

标签: r dplyr tidyr kaggle

我是R的新手,并使用Kaggle的泰坦尼克号数据集练习。我试图将姓氏,名字,称呼和额外信息分成不同的栏目,以便我可以尝试对乘客的年龄进行分类 - 成人或儿童。

以下是来自Train数据集的样本数据:

head(traindf,5)
# Source: local data frame [5 x 12]
# 
# PassengerId Survived Pclass
# 1           1        0      3
# 2           2        1      1
# 3           3        1      3
# 4           4        1      1
# 5           5        0      3
# Variables not shown: Name (chr), Sex (fctr), Age (dbl), SibSp (int), Parch
# (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

以下是包含名称的示例:

select(traindf,Survived,Pclass,Name,Sex)
# Source: local data frame [891 x 4]
# 
# Survived Pclass                                                Name    Sex
# 1         0      3                             Braund, Mr. Owen Harris   male
# 2         1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
# 3         1      3                              Heikkinen, Miss. Laina female
# 4         1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female
# 5         0      3                            Allen, Mr. William Henry   male
# 6         0      3                                    Moran, Mr. James   male
# 7         0      1                             McCarthy, Mr. Timothy J   male
# 8         0      3                      Palsson, Master. Gosta Leonard   male
# 9         1      3   Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female
# 10        1      2                 Nasser, Mrs. Nicholas (Adele Achem) female

我可以使用以下代码将姓氏与列的其余部分分开:

require(tidyr) # for the separate() function

traindfnames <- traindf %>%
  separate(Name, c("Lastname","Salutation"), sep = ",")

traindfnames 
# Source: local data frame [891 x 13]
# 
# PassengerId Survived Pclass  Lastname
# 1            1        0      3    Braund
# 2            2        1      1   Cumings
# 3            3        1      3 Heikkinen
# 4            4        1      1  Futrelle
# 5            5        0      3     Allen
# 6            6        0      3     Moran
# 7            7        0      1  McCarthy
# 8            8        0      3   Palsson
# 9            9        1      3   Johnson
# 10          10        1      2    Nasser
# ..         ...      ...    ...       ...
# Variables not shown: Salutation (chr), Sex (fctr), Age (dbl), SibSp (int),
# Parch (int), Ticket (fctr), Fare (dbl), Cabin (fctr), Embarked (fctr)

但是,当我尝试为名字添加字段时:

traindfnames <- traindf %>%
separate(Name, c("Lastname","Salutation","firstname"), sep =",,")

我收到此错误:

# Error: Values not split into 3 pieces at 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 2

我是否使用了不正确的语法或一列中的3个字段是不可能的?

1 个答案:

答案 0 :(得分:1)

看过这些数据后,我认为最简单的方法是使用str_match()包中的stringr。如果您认为data$Name在表单中 &#34; [姓氏],[称呼]。 [FIRSTNAME]&#34; 与此匹配的正则表达式是

str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")
#      [,1]                                                  [,2]        [,3]   [,4]                                   
# [1,] "Braund, Mr. Owen Harris"                             "Braund"    "Mr"   "Owen Harris"                          
# [2,] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Cumings"   "Mrs"  "John Bradley (Florence Briggs Thayer)"
# [3,] "Heikkinen, Miss. Laina"                              "Heikkinen" "Miss" "Laina"                                
# [4,] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"        "Futrelle"  "Mrs"  "Jacques Heath (Lily May Peel)"        
# [5,] "Allen, Mr. William Henry"                            "Allen"     "Mr"   "William Henry"                        
# [6,] "Moran, Mr. James"                                    "Moran"     "Mr"   "James" 

因此,您需要将上面的第2列到第4列添加到原始数据框中。我不确定你能用separate实际做到这一点。写

separate(data, Name, c("Lastname", "Salutation", "Firstname"), sep = "[,\\.]") 

将尝试用逗号或点分隔每个条目,但它在第514个条目中遇到问题,看起来像罗斯柴尔德,马丁夫人(伊丽莎白L.巴雷特)&#34; (注意第二点)。

简而言之,我能看到做你想做的最简单的方法是

data[c("Firstname", "Salutation", "Lastname")] <-
    str_match(data$Name, "([A-Za-z]*),\\s([A-Za-z]*)\\.\\s(.*)")[, 2:4]