如何提取一系列用逗号,开始和结束词分隔的单词?

时间:2019-06-12 16:53:38

标签: r regex gsub

给出此类文字

this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."

我需要提取“这个家伙,另一个家伙,那个家伙,那个另一个家伙,其他东西”

因此,我需要告诉正则表达式匹配以下任意一个之间出现的单词序列:

两个逗号

“特殊短语”和逗号

逗号和“或”

“或”和空格

如果包含正则表达式的问题最多,那么我会满足于包含一些不需要的单词的解决方案。

我以为代码看起来像这样(由于我是一个正则表达式新手而无法运行):

this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
this_pattern <- "^.*\\b(particular phrase|,|or)\\W(\\w+\\W+)+\\W(,|or).*$"
gsub(this_pattern, "\\2", this_txt, ignore.case = T)

编辑:

我越来越近了(确实可以运行):

  this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
  this_pattern <- "^.*\\b(particular phrase)\\W+(.*)\\W+(,|or).*$"
  gsub(this_pattern, "\\2", this_txt, ignore.case = T)
#[1] "this guy, this other guy, that guy, that other guy,"

但是如何包含最后一个项目“其他”?

1 个答案:

答案 0 :(得分:0)

这是您当前所能获得的最接近的信息:

(?:\bparticular phrase\b|\bor\b|,)\s*\b(?!or\b)(\w+(?:[^,.\w]+\w+)*?)(?=\s*(?:,|\bor\b))

请参见regex demo

详细信息

  • (?:\bparticular phrase\b|\bor\b|,)-整个单词orparticular phrase,或逗号
  • \s*-超过0个空格
  • \b-单词边界
  • (?!or\b)-下一个单词不能为or
  • (\w+(?:[^,.\w]+\w+)*?)-第1组:
    • \w+-1个以上的字符字符
    • (?:[^,.\w]+\w+)*?-0次以上的重复
      • [^,.\w]+-除逗号,点号或单词字符外的1个以上字符
      • \w+-1个以上的字符字符
  • (?=\s*(?:,|\bor\b))-一个正向的超前查询,需要0+个空格,后跟一个逗号,或者在当前位置后紧跟一个单词or

R demo

pattern <- "(?:\\bparticular phrase\\b|\\bor\\b|,)\\s*\\b(?!or\\b)\\K\\w+(?:[^,.\\w]+\\w+)*(?=\\s*,|\\bor\\b)"
this_txt <- "Blah blah blah particular phrase this guy, this other guy, that guy, that other guy, or something else blah blah blah, blah blah. Blah blah blah, blah; and so blah."
regmatches(this_txt, gregexpr(pattern, this_txt, perl=TRUE, ignore.case=TRUE))[[1]]

输出:

[1] "this guy"                      "this other guy"               
[3] "that guy"                      "that other guy"               
[5] "something else blah blah blah"