Question

（在R中）如何在不拆分缩写的情况下按标题大小写将诸如“ WeLiveInCA”之类的字符串拆分为“ We Live In CA”？

我知道如何在每个大写字母处拆分字符串，但是这样做会拆分首字母缩写/缩写，例如CA或USSR甚至是U.S.A.，我需要保留它们。

所以我在考虑某种if a word in a string isn't an initialism then split the word with a space where a lowercase character is followed by an uppercase character之类的逻辑。

我下面的代码片段用大写字母用空格分隔单词，但它破坏了CA变成C A的首字母缩写。

s <- "WeLiveInCA"
trimws(gsub('([[:upper:]])', ' \\1', s))
# "We Live In C A"

或另一个示例...

s <- c("IDon'tEatKittensFYI", "YouKnowYourABCs")
trimws(gsub('([[:upper:]])', ' \\1', s))
# "I Don't Eat Kittens F Y I" "You Know Your A B Cs"

我想要的结果是：

"We Live In CA"
#
"I Don't Eat Kittens FYI" "You Know Your ABCs"

但这需要广泛适用（不仅仅是我的例子）

Answer 1

尝试使用基数R gregexpr/regmatches。

s <- c("WeLiveInCA", "IDon'tEatKittensFYI", "YouKnowYourABCs")
regmatches(s, gregexpr('[[:upper:]]+[^[:upper:]]*', s))
#[[1]]
#[1] "We"   "Live" "In"   "CA"  
#
#[[2]]
#[1] "IDon't"  "Eat"     "Kittens" "FYI"    
#
#[[3]]
#[1] "You"  "Know" "Your" "ABCs"

说明。

[[:upper:]]+匹配一个或多个大写字母；
[^[:upper:]]*匹配零个或多个出现的除了大写字母之外的东西。
这两个正则表达式按顺序匹配以大写字母开头的单词，然后是其他字母。

（在R中）如何在保留缩写的情况下按标题大小写将“ WeLiveInCA”之类的字符串拆分为“ We Live In CA”？

1 个答案: