删除两个括号之间的所有文本

时间:2014-05-31 05:21:54

标签: regex r stringr

假设我有一些这样的文字,

text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")

我想删除(编辑:删除)[和](以及括号本身)之间的所有文本。最好的方法是什么?这是我使用正则表达式和stingr包的微弱尝试:

str_extract(text, "\\[[a-z]*\\]")

感谢您的帮助!

5 个答案:

答案 0 :(得分:21)

有了这个:

gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);

正则表达式意味着什么:

  \[                       # '['
  [^\]]*                   # any character except: '\]' (0 or more
                           # times (matching the most amount possible))
  \]                       # ']'

答案 1 :(得分:9)

以下应该可以解决问题。 ?强制执行惰性匹配,在随后的.之前匹配尽可能少]

gsub('\\[.*?\\]', '', text)

答案 2 :(得分:3)

这是另一种方法:

library(qdap)
bracketX(text, "square")

答案 3 :(得分:3)

不需要使用具有否定字符类/括号表达的PCRE正则表达式,&#34; classic&#34; TRE正则表达式也会起作用:

subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some  here and "

请参阅online R demo

<强>详情:

  • \\[ - 文字[(必须在[[]之类的括号表达式中进行转义或使用,才能解析为文字[
  • [^][]* - 一个否定括号表达式,匹配[]以外的0 +字符(请注意,括号表达式开头的]被视为文字]
  • ] - 文字](此字符在PCRE和TRE regexp中并不特殊,不必转义)。

如果您只想用其他分隔符替换方括号,请在替换模式中使用带有backreference的捕获组:

gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"

请参阅another demo

(...)括号构造形成一个捕获组,其内容可以使用反向引用\1进行访问(因为该组是模式中的第一个,其ID设置为1)。

答案 4 :(得分:2)

我认为这在技术上可以回答您的要求,但是您可能想在正则表达式的末尾添加\\:以获得更漂亮的文本(删除冒号和空格)。

library(stringr)
str_replace_all(text, "\\[.+?\\]", "")

#> [1] ": We need tax policies that respect the wage earners..."

vs ...

str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..." 

reprex package(v0.2.0)于2018-08-16创建。