我正在尝试使用R regex复制此答案,并将其限制为仅连续2/3个大写字母,并考虑完全大写的单词:Get consecutive capitalized words using regex
想法是从其他混乱的单词垃圾中提取名称:
test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"
desired_extract
[1] Andrew Smith
[2] Samuel L Jackson
[3] DEREK JETER
[4] MIKE NELSON TROUT
答案 0 :(得分:2)
您要查找的是使用{1,2}运算符而不是+来限制重复次数。
([A-Z]+[a-z]*(?=\s[A-Z])(?:\s[A-Z]+[a-z]*){1,2})
编辑:经过编辑,因此它也适用于所有大写字母。
答案 1 :(得分:0)
使用基数为R regmatches
/ gregexpr
的PCRE正则表达式,并使用SKIP-FAIL technique来匹配和跳过4个或更多大写单词的块,并且仅保留1到3个大写单词块:
(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b
请参见regex demo
详细信息
(*UCP)
-使\b
和\s
能够识别Unicode的PCRE动词\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b
-单词边界(\b
),一个大写字母,后跟0+小写字母(\p{Lu}\p{L}*
,一个“大写单词”),然后是3个或更多重复1 +空格(\s+
)后跟大写字母(*SKIP)(*F)
-如果使用此替代方法找到了匹配项,请将其丢弃并继续寻找其他匹配项|
-或\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b
-在单词边界内用2或3个空格分隔大写单词。请参见R demo online:
test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"
block <- "\\b\\p{Lu}\\p{L}*(?:\\s+\\p{Lu}\\p{L}*)"
regex <- paste0("(*UCP)", block, "{3,}\\b(*SKIP)(*F)|", block, "{1,2}\\b")
##regex <- "(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b"
regmatches(test_string, gregexpr(regex, test_string, perl=TRUE))
输出:
[[1]]
[1] "Andrew Smith" "Samuel L Jackson" "DEREK JETER"
[4] "MIKE NELSON TROUT"