我有一个所有上限的所有者名单列表,我想将其转换为正确的大写字母:
owner1
1: DXXXXX JOSEPH V JR
2: MIRNA NXXXXX
3: ADRIAN TXXXX
4: CUTLER PXXXXXXXXX LLC
5: GVM PXXXXXXXXX LLC
6: EARLENA RXXXXXXX
7: NATHANIEL TXXXXX
8: DXXXXXX DONNA
9: LXXXX ELAINE E TR
10: SXXXXXX KIMBERLY
(用于复制目的:
owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY")
)
期望的输出:
owner1
1: Dxxxxx Joseph V. Jr
2: Mirna Nxxxxx
3: Adrian Txxxx
4: Cutler Pxxxxxxxxx LLC
5: GVM Pxxxxxxxxx LLC
6: Earlena Rxxxxxxx
7: Nathaniel Txxxxx
8: Dxxxxxx Donna
9: Lxxxx Elaine E. TR
10: Sxxxxxx Kimberly
重要的第一步是.simpleCap
中提到的?chartr
功能版本:
.simpleCap <- function(x) {
s <- strsplit(tolower(x), " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
这是问题的一大部分,但在4,5和9上失败。我可以补充这个来分别处理关键短语(LLC,TR等),但这仍然留下像观察5。
这是我到目前为止所使用的功能(通过下面的@ eipi10解决方案非常巧妙地加速了.simpleCap
函数,允许将整个函数应用于向量):
to.proper<-function(strings){
#vectorized version of .simpleCap;
# I've also built in that I know `strings` is all caps
res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
#In my data, some Irish/Scottish names separated the MC prefix
# Also, re-capitalize following a hyphen
res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
for (init in c("[A-Z]","Inc","Assoc","Co",
"Jr","Sr","Tr","Bros")){
#Add a period after common abbreviations
res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
}
for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
"Pa","Ii","Iii","Iv","Lp","Tj",
"Xiv","Ll","Yml","Us")){
#Re-capitalize any string of >=3 consonants (excluding
# Y for such names as LYNN and WYNN), as well as
# some other common phrases that need upper-casing
res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
}
#Re-capitalize post-Mc letters, e.g. in Mcmahon
gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}
任何关于健壮的方法 - 在这个过程中单独留下可能不可预测的缩写的方法(特别是像观察5中那些不常见的那些)?
答案 0 :(得分:2)
这是一个使用正则表达式将字符串转换为标题大小写的函数(改编自@BenBolker's answer to a question I asked on SO a while back)。
编写函数,以便您可以传递一个名为exceptions
的参数来处理GVM
等特殊情况。我不确定这是否足够灵活以满足您的需求,因为您必须对异常进行硬编码,但我想我会发布它并看看是否有人可以提出改进建议。
dat = data.frame(owner1 = c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY"))
# Convert a string to title case
tc = function(strings, exceptions="\\b(gvm)\\b") {
# Convert to title case, excluding terminal LLC, TR, etc.
title.case = gsub("\\b([a-zA-Z])([a-zA-Z]+)*( LLC| TR| FBO| LP)?",
"\\U\\1\\L\\2\\U\\3", strings, perl=TRUE)
# Add a period after initials (presumed to be any lone capital letter)
title.case = gsub(" ([A-Z]) ", " \\1\\. ", title.case)
# Deal with exceptions
title.case = gsub(exceptions, "\\U\\1", title.case, perl=TRUE, ignore.case=TRUE)
return(title.case)
}
dat$title.case = tc(dat$owner1)
owner1 title.case
1 DXXXXX JOSEPH V JR Dxxxxx Joseph V. Jr
2 MIRNA NXXXXX Mirna Nxxxxx
3 ADRIAN TXXXX Adrian Txxxx
4 CUTLER PXXXXXXXXX LLC Cutler Pxxxxxxxxx LLC
5 GVM PXXXXXXXXX LLC GVM Pxxxxxxxxx LLC
6 EARLENA RXXXXXXX Earlena Rxxxxxxx
7 NATHANIEL TXXXXX Nathaniel Txxxxx
8 DXXXXXX DONNA Dxxxxxx Donna
9 LXXXX ELAINE E TR Lxxxx Elaine E. TR
10 SXXXXXX KIMBERLY Sxxxxxx Kimberly