R正则表达式,用于识别字符串[R]中的2个或3个连续大写单词

时间:2019-06-21 16:55:17

标签: r regex

我正在尝试使用R regex复制此答案,并将其限制为仅连续2/3个大写字母,并考虑完全大写的单词:Get consecutive capitalized words using regex

想法是从其他混乱的单词垃圾中提取名称:

    test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"

    desired_extract
    [1] Andrew Smith
    [2] Samuel L Jackson
    [3] DEREK JETER
    [4] MIKE NELSON TROUT

2 个答案:

答案 0 :(得分:2)

您要查找的是使用{1,2}运算符而不是+来限制重复次数。

([A-Z]+[a-z]*(?=\s[A-Z])(?:\s[A-Z]+[a-z]*){1,2})

编辑:经过编辑,因此它也适用于所有大写字母。

答案 1 :(得分:0)

使用基数为R regmatches / gregexpr的PCRE正则表达式,并使用SKIP-FAIL technique来匹配和跳过4个或更多大写单词的块,并且仅保留1到3个大写单词块:

(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b

请参见regex demo

详细信息

  • (*UCP)-使\b\s能够识别Unicode的PCRE动词
  • \b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b-单词边界(\b),一个大写字母,后跟0+小写字母(\p{Lu}\p{L}*,一个“大写单词”),然后是3个或更多重复1 +空格(\s+)后跟大写字母
  • (*SKIP)(*F)-如果使用此替代方法找到了匹配项,请将其丢弃并继续寻找其他匹配项
  • |-或
  • \b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b-在单词边界内用2或3个空格分隔大写单词。

请参见R demo online

test_string <- "we need a test for Andrew Smith or other names like Samuel L Jackson, but we Don't Want Weird Instances Where more stuff is capitalized, but we do want where the entire name is capitalized, like DEREK JETER or MIKE NELSON TROUT"
block <- "\\b\\p{Lu}\\p{L}*(?:\\s+\\p{Lu}\\p{L}*)"
regex <- paste0("(*UCP)", block, "{3,}\\b(*SKIP)(*F)|", block, "{1,2}\\b")
##regex <- "(*UCP)\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){3,}\b(*SKIP)(*F)|\b\p{Lu}\p{L}*(?:\s+\p{Lu}\p{L}*){1,2}\b"
regmatches(test_string, gregexpr(regex, test_string, perl=TRUE))

输出:

[[1]]
[1] "Andrew Smith"      "Samuel L Jackson"  "DEREK JETER"      
[4] "MIKE NELSON TROUT"