从字符串中提取字符

时间:2014-01-26 22:31:18

标签: r loops if-statement grep

数据集结构是:

> str(trainData)
'data.frame':   891 obs. of  13 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
 $ Pclass     : Factor w/ 3 levels "1st","2nd","3rd": 3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : Factor w/ 2 levels "Male","Female": 1 2 2 2 1 1 1 1 2 2 ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : int  NA NA NA 113803 373450 330877 17463 349909 347742 237736 ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...
 $ Area       : Factor w/ 9 levels "","A","B","C",..: 1 4 1 4 1 1 6 1 1 1 ...

我想在数据框中创建一个新列,以存储Name变量中包含的地址形式。为此,我需要提取字符串“Mr”,“Mrs”等等,并将它们存储在一个新的向量中。我想以下列方式解决问题。

vec <- vector()

for (i in 1 : nrow(trainData)) {
  if (grep("Mr\\.", trainData[i, "Name"]) == 1) {vec[i] <- "Mr"}
  else if (grep("Miss\\.", trainData[i, "Name"]) == 1) {vec[i] <- "Miss"}
  else if (grep("Mrs\\.", trainData[i, "Name"]) == 1) {vec[i] <- "Mrs"}
  else if (grep("Don\\.", trainData[i, "Name"]) == 1) {vec[i] <- "Don"}
  else if (grep("Master\\.", trainData[i, "Name"]) == 1) {vec[i] <- "Master"}
  else {vec[i] <- "Boh"}
}

..然后使用cbind函数将现有数据框与新列绑定 FormOfAddress。我没有测试接下来的两行代码,因为我收到了前一个块的错误消息。

trainData <- as.data.frame(cbind(trainData, vec))
names(trainData)[length(trainData)] <- "FormOfAddress"

基本上我在这一点上卡住了..

> vec <- vector()
> for (i in 1 : nrow(trainData)) {
+ if (grep("Mr\\.", trainData[i, c("Name")]) == 1) {vec[i] <- "Mr"}
+ else if (grep("Miss\\.", trainData[i, c("Name")]) == 1) {vec[i] <- "Miss"}
+ else if (grep("Mrs\\.", trainData[i, c("Name")]) == 1) {vec[i] <- "Mrs"}
+ else if (grep("Don\\.", trainData[i, c("Name")]) == 1) {vec[i] <- "Don"}
+ else if (grep("Master\\.", trainData[i, c("Name")]) == 1) {vec[i] <- "Master"}
+ else {vec[i] <- "Boh"; next}
+ }
Error in if (grep("Mr\\.", trainData[i, c("Name")]) == 1) { : 
  argument is of length zero

if语句的第一部分对我来说是正确的。当字符串Mr.包含在名称中时,它将返回TRUE。另外第二部分看起来很好(至少在第一个循环上)并在向量Mr上写出字符串vec。 问题在于我认为的第二个循环,但我找不到让它工作的方法。

1 个答案:

答案 0 :(得分:0)

trainData$Name

## [1] "Braund, Mr. Owen Harris"                            
## [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
## [3] "Heikkinen, Miss. Laina"                             
## [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
## [5] "tt"                                                 
## [6] "Mr. Jones"                                          

for (x in trainData$Name) {
    print(grep("Mr\\.", x))
    print(grepl("Mr\\.", x));
}

## [1] 1
## [1] TRUE
## integer(0)
## [1] FALSE
## integer(0)
## [1] FALSE
## integer(0)
## [1] FALSE
## integer(0)
## [1] FALSE
## [1] 1
## [1] TRUE

## Doing it without a loop.
## You might have to come up with a different
## regex here depending on the rest of your data
vec <- gsub("^([^,]+, )?([^.]+).*", "\\2", trainData$Name)
## [1] "Mr"   "Mrs"  "Miss" "Mrs"  "tt"   "Mr"  
vec <- ifelse(vec == trainData$Name, "Boh", vec)
## [1] "Mr"   "Mrs"  "Miss" "Mrs"  "Boh"  "Mr"