给出此类文字
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
我需要提取“这个家伙,另一个家伙,那个家伙,那个另一个家伙,其他东西”
因此,我需要告诉正则表达式匹配以下任意一个之间出现的单词序列:
两个逗号
“特殊短语”和逗号
逗号和“或”
“或”和空格
如果包含正则表达式的问题最多,那么我会满足于包含一些不需要的单词的解决方案。
我以为代码看起来像这样(由于我是一个正则表达式新手而无法运行):
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase|,|or)\\W(\\w+\\W+)+\\W(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
编辑:
我越来越近了(确实可以运行):
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase)\\W+(.*)\\W+(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)
#[1] "this guy, this other guy, that guy, that other guy,"
但是如何包含最后一个项目“其他”?
答案 0 :(得分:0)
这是您当前所能获得的最接近的信息:
(?:\bparticular phrase\b|\bor\b|,)\s*\b(?!or\b)(\w+(?:[^,.\w]+\w+)*?)(?=\s*(?:,|\bor\b))
请参见regex demo
详细信息
(?:\bparticular phrase\b|\bor\b|,)
-整个单词or
或particular phrase
,或逗号\s*
-超过0个空格\b
-单词边界(?!or\b)
-下一个单词不能为or
(\w+(?:[^,.\w]+\w+)*?)
-第1组:
\w+
-1个以上的字符字符(?:[^,.\w]+\w+)*?
-0次以上的重复
[^,.\w]+
-除逗号,点号或单词字符外的1个以上字符\w+
-1个以上的字符字符(?=\s*(?:,|\bor\b))
-一个正向的超前查询,需要0+个空格,后跟一个逗号,或者在当前位置后紧跟一个单词or
。pattern <- "(?:\\bparticular phrase\\b|\\bor\\b|,)\\s*\\b(?!or\\b)\\K\\w+(?:[^,.\\w]+\\w+)*(?=\\s*,|\\bor\\b)"
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
regmatches(this_txt, gregexpr(pattern, this_txt, perl=TRUE, ignore.case=TRUE))[[1]]
输出:
[1] "this guy" "this other guy"
[3] "that guy" "that other guy"
[5] "something else blah blah blah"