Question

我正在尝试使用R regex复制此答案，并将其限制为仅连续2/3个大写字母，并考虑完全大写的单词：Get consecutive capitalized words using regex

想法是从其他混乱的单词垃圾中提取名称：

    test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"

    desired_extract
    [1] Andrew Smith
    [2] Samuel L Jackson
    [3] DEREK JETER
    [4] MIKE NELSON TROUT

Answer 1

您要查找的是使用{1,2}运算符而不是+来限制重复次数。

([A-Z]+[a-z]*(?=\s[A-Z])(?:\s[A-Z]+[a-z]*){1,2})

编辑：经过编辑，因此它也适用于所有大写字母。

Answer 2

使用基数为R regmatches / gregexpr的PCRE正则表达式，并使用SKIP-FAIL technique来匹配和跳过4个或更多大写单词的块，并且仅保留1到3个大写单词块：

(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b

请参见regex demo

详细信息

(*UCP)-使\b和\s能够识别Unicode的PCRE动词
\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b-单词边界（\b），一个大写字母，后跟0+小写字母（\p{Lu}\p{L}*，一个“大写单词”），然后是3个或更多重复1 +空格（\s+）后跟大写字母
(*SKIP)(*F)-如果使用此替代方法找到了匹配项，请将其丢弃并继续寻找其他匹配项
|-或
\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b-在单词边界内用2或3个空格分隔大写单词。

请参见R demo online：

test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"
block <- "\\b\\p{Lu}\\p{L}*(?:\\s+\\p{Lu}\\p{L}*)"
regex <- paste0("(*UCP)", block, "{3,}\\b(*SKIP)(*F)|", block, "{1,2}\\b")
##regex <- "(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b"
regmatches(test_string, gregexpr(regex, test_string, perl=TRUE))

输出：

[[1]]
[1] "Andrew Smith"      "Samuel L Jackson"  "DEREK JETER"      
[4] "MIKE NELSON TROUT"

R正则表达式，用于识别字符串[R]中的2个或3个连续大写单词

2 个答案: