R中的正则表达式匹配方括号中的字符串

时间:2018-05-22 08:16:24

标签: r regex

我有讲故事的抄本,其中有许多重叠的语音实例,用方括号表示重叠的语音。我想提取这些重叠的实例。在下面的模拟示例中,

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")

这段代码工作正常:

pattern <- "\\[(.*\\w.+])*"
grep(pattern, ovl, value=T) 
matches <- gregexpr(pattern, ovl) 
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap); overlap_clean
[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

但是在一个更大的文件中,一个数据帧,它没有。这是由于模式中的错误还是由于数据帧的结构如何? df的前六行看起来像这样:

> head(df)
                                                             Story
1 "Kar:\tMind you our Colin's getting more like your dad every day
2                                             June:\tI know he is.
3                                 Kar:\tblack welding glasses on, 
4                        \tand he turned round and he made me jump
5                                                 \t“O:h, Colin”, 
6                                  \tand then (                  )

2 个答案:

答案 0 :(得分:2)

虽然它可能在某些情况下起作用,但你的模式对我来说很重要。我认为应该是这样的:

pattern <- "(\\[.*?\\])"
matches <- gregexpr(pattern, ovl)
overlap <- regmatches(ovl, matches)
overlap_clean <- unlist(overlap)
overlap_clean

[1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

Demo

这将匹配并捕获括号中的术语,使用Perl惰性点确保我们停在第一个结束括号。

答案 1 :(得分:0)

匹配[]之间的字符串,且两次使用之间不带方括号

"\\[[^][]*]"

[a]模式不同,它将匹配[a[a]字符串中的\[.*?]

详细信息

  • \[-一个[字符
  • [^][]*-一个与[]之外的0个或多个字符匹配的否定括号表达式(或字符类)
  • ]-一个]字符(无需在字符类/括号表达式之外转义)

请参见Regulex graph

enter image description here

请参见R demo online

ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(regmatches(ovl, gregexpr("\\[[^][]*]", ovl)))
## => [1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

使用stringr::str_extract_all

library(stringr)
ovl <- c("well [yes right]", "let's go", "oh [  we::ll] i do n't (0.5) know", "erm [°well right° ]", "(3.2)")
unlist(str_extract_all(ovl, "\\[[^\\]\\[]*]"))
## => [1] "[yes right]"     "[  we::ll]"      "[°well right° ]"

在这里,由于该模式是使用ICU regex库处理的,因此您需要在regex模式中转义两个方括号。