背景
我试图使用humaniformat包将一长串全名解析为单独的名/姓。问题是许多名称在最后都有凭证无法识别,因此它错误地将凭证标识为姓氏。
问题
如何从每个名称的末尾删除已知集中的所有凭据?
名称格式如下:
df<-data.frame(Name = c("ADAM WEST RN CDE",
"KEVIN CONROY RDN LD CDE",
"VAL KILMER RN CNS",
"CHRISTIAN BALE RN CDE",
"MICHAEL KEATON BS MED PHD"))
ADAM WEST RN CDE
KEVIN CONROY RDN LD CDE
VAL KILMER RN CNS
CHRISTIAN BALE RN CDE
MICHAEL KEATON BS MED PHD
结果我正在寻找:
Fixed_Name
ADAM WEST
KEVIN CONROY
VAL KILMER
CHRISTIAN BALE
MICHAEL KEATON
我已经尝试了以下操作,但它只删除了名称末尾列出的第一个凭据,并留下了其余的凭据。
df$Fixed_Name<-gsub(" RN[^A-Z]| CDE[^A-Z]| LD[^A-Z]| RDN[^A-Z]| CNS[^A-Z]
| K M[^A-Z]| DO[^A-Z]| PA[^A-Z]| MS[^A-Z]| MSN[^A-Z]
| BS[^A-Z]| RPH[^A-Z]| MED[^A-Z]| CDE[^A-Z]
| BS[^A-Z]| MED[^A-Z]| PHD[^A-Z]"," ",df$Name)
答案 0 :(得分:3)
这样的事情
rex <- "( (RN|CDE|LD|RDN|CNS|K M|DO|PA|MS|MSN|BS|RPH|MED|PHD))*$"
df$Fixed_Name<-gsub(rex,"",df$Name)
df
# Name Fixed_Name
# 1 ADAM WEST RN CDE ADAM WEST
# 2 KEVIN CONROY RDN LD CDE KEVIN CONROY
# 3 VAL KILMER RN CNS VAL KILMER
# 4 CHRISTIAN BALE RN CDE CHRISTIAN BALE
# 5 MICHAEL KEATON BS MED PHD MICHAEL KEATON
在这里,我们在字符串末尾查找<space>title
零次或多次并删除它。
答案 1 :(得分:2)
您可以添加元字符(.*)
来解释它,这是解决方案
> df<-data.frame(Name = c("ADAM WEST RN CDE",
+ "KEVIN CONROY RDN LD CDE",
+ "VAL KILMER RN CNS",
+ "CHRISTIAN BALE RN CDE",
+ "MICHAEL KEATON BS MED PHD"))
>
> df$Fixed_Name<-gsub(" RN[^A-Z](.*)| CDE[^A-Z](.*)| LD[^A-Z](.*)| RDN[^A-Z](.*)| CNS[^A-Z](.*)
+ | K M[^A-Z](.*)| DO[^A-Z](.*)| PA[^A-Z](.*)| MS[^A-Z](.*)| MSN[^A-Z](.*)
+ | BS[^A-Z](.*)| RPH[^A-Z](.*)| MED[^A-Z](.*)| CDE[^A-Z](.*)
+ | BS[^A-Z](.*)| MED[^A-Z](.*)| PHD[^A-Z](.*)"," ",df$Name)
> df
Name Fixed_Name
1 ADAM WEST RN CDE ADAM WEST
2 KEVIN CONROY RDN LD CDE KEVIN CONROY
3 VAL KILMER RN CNS VAL KILMER
4 CHRISTIAN BALE RN CDE CHRISTIAN BALE
5 MICHAEL KEATON BS MED PHD MICHAEL KEATON