Question

我要更改此内容

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")

进入此：

output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")

唯一要修改的词是全部大写的词。

修改：

一种边缘情况，其中序列的第一个字母不在[A-Z]中：

input <- "Philippe Fabre d'ÉGLANTINE"

Answer 1

这是替代解决方案：

gsub("(?<=\\p{L})(\\p{L}+)", "\\L\\1", input, perl = TRUE)

我不是要与其他现有答案竞争，我只是解决（或尝试）了挑战并在此处分享，因为它可能对某人有用，并且/或者我获得了有关如何改进的建设性反馈

修改

出于某种原因，我跳过了

只有大写的单词[...]

我认为以下内容可以更好地解决这一问题：

gsub("(?<=\\b\\p{Lu})(\\p{Lu}+\\b)", "\\L\\1", input, perl = TRUE)

Answer 2

检测所有大写字符并使用任何编码方式的通用答案将是：

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR", "Philippe Fabre d'ÉGLANTINE")
gsub("(*UCP)\\b(\\p{Lu})(\\p{Lu}+)\\b", "\\1\\L\\2", input, perl = TRUE)
# [1] "Théodore Agrippa d'Aubigné"    "Vital d'Audiguier De La Menor" "Philippe Fabre d'Églantine"

贷方转到@Wiktor-Stribiżew

\p{Lu}检测到任何Unicode大写字符，第二个字符可用\w替换，以允许使用下划线和数字（此处将提供相同的输出）。

(*UCP)对于在此处重现结果不是必需的，但是如果输入字符串的编码与本机编码不同，它将很方便。用Wiktors的话来说，它使模式成为“可识别Unicode”。

Answer 3

组成两组，在两侧都有边界，如

\b([A-Z])(\w+)\b

并在第二组上使用tolower（保持第一组不变）。
参见a demo on regex101.com（请注意修饰符，尤其是u）。

作为旁注：您还有几个问题（尚未接受）答案。

Answer 4

您还可以使用snakecase pkg并专门设置sep_in = " "来不删除非字母数字，例如'（默认为sep_in = "[^[:alnum:]]"）：

library(snakecase)

input <- c("Théodore Agrippa d'AUBIGNÉ", "Vital d'AUDIGUIER DE LA MENOR")
output <- c("Théodore Agrippa d'Aubigné", "Vital d'Audiguier De La Menor")

to_title_case(input, sep_in = " ")
#> [1] "Théodore Agrippa d'Aubigné"    "Vital d'Audiguier De La Menor"

identical(to_title_case(input, sep_in = " "), output)
#> [1] TRUE

^{由reprex package（v0.3.0）于2019-08-01创建}

之所以可行，是因为

sep_in

snakecase 会将特殊字符视为单词。
snakecase::to_title_case()首先应用snakecase::to_sentence_case()，该词用“”分隔单词，然后将（小写）结果包装在tools::toTitleCase()内，该结果不大写单独的“ d”，即“ d'aubigné”变为“ d'Aubigné”。
snakecase 始终“保护”其输出，即，它清除非字母数字字符（此处为'）周围杂乱且可能不是预期的输出分隔符（此处为“”）。（对于数字字符，可以通过numerals参数来调整行为。）

将大写单词转换为标题大小写

4 个答案: