Question

我试图找到一种简单的方法来提取出现在两个已知子串之间的未知子串（可能是任何东西）。例如，我有一个字符串：

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

我需要提取STR1和STR2之间的字符串GET_ME（不带空格）。

我正在尝试str_extract(a, "STR1 (.+) STR2")，但我得到了整场比赛

[1] "STR1 GET_ME STR2"

我当然可以剥离已知的字符串，以隔离我需要的子字符串，但我认为应该有一种更清晰的方法来使用正确的正则表达式。

Answer 1

您可以将str_match与STR1 (.*?) STR2一起使用（请注意空格是“有意义的”，如果您只想匹配STR1和STR2之间的任何内容{{1} }}）。如果您有多次出现，请使用STR1(.*?)STR2。

str_match_all

使用基础R library(stringr) a<-" anything goes here, STR1 GET_ME STR2, anything goes here" res <- str_match(a, "STR1 (.*?) STR2") res[,2] [1] "GET_ME"的另一种方式（获得第一场比赛）：

regexec

Answer 2

这是使用基数R的另一种方法

a<-" anything goes here, STR1 GET_ME STR2, anything goes here"

gsub(".*STR1 (.+) STR2.*", "\\1", a)

输出：

[1] "GET_ME"

Answer 3

另一种选择是使用qdapRegex::ex_between提取左右边界之间的字符串

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"

它也可以多次出现

a <- "anything STR1 GET_ME STR2, anything goes here, STR1 again get me STR2"

qdapRegex::ex_between(a, "STR1", "STR2")[[1]]
#[1] "GET_ME"       "again get me"

或多个左右边界

a <- "anything STR1 GET_ME STR2, anything goes here, STR4 again get me STR5"
qdapRegex::ex_between(a, c("STR1", "STR4"), c("STR2", "STR5"))[[1]]
#[1] "GET_ME"       "again get me"

第一次捕获在“ STR1”和“ STR2”之间，而第二次捕获在“ STR4”和“ STR5”之间。

Answer 4

我们可以使用 {unglue} ，在这种情况下，我们根本不需要正则表达式：

library(unglue)
unglue::unglue_vec(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{}STR1 {x} STR2{}")
#> [1] "GET_ME"

{}匹配任何不保留的内容，{x}捕获其匹配项（可以使用x以外的任何变量。语法可能是"{=.*?}STR1 {x=.*?} STR2{=.*?}"的缩写

如果您也想提取侧面，可以这样做：

unglue::unglue_data(
  " anything goes here, STR1 GET_ME STR2, anything goes here", 
  "{left}, STR1 {x} STR2, {right}")
#>                  left      x              right
#> 1  anything goes here GET_ME anything goes here

在R中的其他两个字符串之间提取字符串

4 个答案: