如何使用正则表达式匹配人物的标题

时间:2019-02-07 16:49:52

标签: r grepl

通过使用正则表达式来匹配标题。编写R片段,以创建一个名为“ Female”的新列,并根据“ Name”列中提供的文本为其填充TRUE / FALSE值。就像如果“ Miss”为TRUE,如果没有称呼为“ NA”一样

这是数据框

df <- data.frame(PersonID=1:8, Name=c("Mr. Bob", "Ms. Blank", "Roger, Mr.", "MR Mark Simpson", "Miss Lisa", "Mrs. joshep", "Rakesh Kumar", "Kumar Gums Murphy"))

grepl("Miss", df, perl=TRUE)

输出:

FALSE,FALSE,FALSE

预期输出:

FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,NA,NA

有人可以帮我吗?

1 个答案:

答案 0 :(得分:1)

如果您想将NA设置为未指定,则必须首先排除不存在其他指定的情况。也就是说,仅仅因为"Miss"不存在并不意味着"Mr""MISS"不存在。

在您的示例中,以下内容将分配"M""F"NA。请根据需要添加名称。

Titles <- c("Miss", "Ms","Mr","Mrs","MR","MS","MRS","MISS") # vector of possible titles
f.Titles <- c("Miss", "Ms","Mrs","MS","MRS","MISS") # vector of female specific titles
check <- NULL
for(i in 1:length(Titles)){
  check <- cbind(check,grepl(Titles[i], df$Name, perl=TRUE))
}

colnames(check) <- Titles
apply(check,1,function(x)ifelse(!any(x),NA,
                                ifelse(any(names(which(x)) %in% f.Titles),"F","M")))

输出:

[1] "M" "F" "M" "M" "F" "F" NA  NA 

从那里开始

G <- apply(check,1,function(x)ifelse(!any(x),NA,
                                     ifelse(any(names(which(x)) %in% f.Titles),"F","M")))

df$Female <- ifelse(G=="F",TRUE,ifelse(is.na(G),NA,FALSE))
df
  PersonID              Name Female
1        1           Mr. Bob  FALSE
2        2         Ms. Blank   TRUE
3        3        Roger, Mr.  FALSE
4        4   MR Mark Simpson  FALSE
5        5         Miss Lisa   TRUE
6        6       Mrs. joshep   TRUE
7        7      Rakesh Kumar     NA
8        8 Kumar Gums Murphy     NA

编辑1:

这是一个更高效的版本,完全可以满足您的要求。仍然需要指定所有可能的Titles和女性标题(f.Titles

check <- apply(as.matrix(Titles), 1, function(x) grepl(x, df$Name, perl=TRUE))
colnames(check) <- Titles
df$Female <- apply(check,1,function(x)ifelse(!any(x),NA,ifelse(any(names(which(x)) %in% f.Titles),TRUE,FALSE)))