Question

我试图在单词之间提取字符串。考虑这个例子 -

x <-  "There are 2.3 million species in the world"

这也可能采取另一种形式

x <-  "There are 2.3 billion species in the world"

我需要There与＆＃39; million或billion之间的文字，包括它们。百万或十亿的存在是在运行时决定的，它不是事先决定的。所以我从这句话中得到的输出是

[1] There are 2.3 million或
[2] There are 2.3 billion

我正在使用rm_between包中的qdapRegex函数。使用此命令，我一次只能提取其中一个。

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)

或者我必须使用

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

如何编写可以检查同一句子中million或billion是否存在的命令。像这样的东西 -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

我希望这很清楚。任何帮助，将不胜感激。

Answer 1

您可以使用str_extact_all（全局匹配）或str_extract（单一匹配）

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

或

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))

Answer 2

left中的right和rm_between参数需要vector个字符/数字符号。因此，您可以在left/right个参数中使用长度相等的向量。

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

或者

  sub('\\s*species.*', '', x)

数据

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

Answer 3

~~使用rm_between，您可以提供与doc状态等长的多个标记的向量。~~

修改

有关rm_between的更新参数，请参阅@ TylerRinker的answer。

虽然，您可以使用用户定义的正则表达式的另一种方法是rm_default：

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

示例：

library(qdapRegex) x <- c( 'There are 2.3 million species in the world', 'There are 2.3 billion species in the world' ) rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE) ## [[1]] ## [1] "There are 2.3 million" ## [[2]] ## [1] "There are 2.3 billion"

Answer 4

@ hwnd＆＃39; s（我的同事 qdapRegex 共同作者）的回复激发了一场讨论，为fixed带来了新的论点rm_between。以下描述在开发版本中：

rm_between和r_between_multiple获取fixed参数。以前，默认情况下会修复包含正则表达式特殊字符的left和right边界（转义）。这不允许对左/右边界强有力地使用正则表达式。 fixed = TRUE行为仍然是默认行为，但用户现在可以设置fixed = FALSE以使用正则表达式边界。这个新功能的灵感来自@Ronak Shah的StackOverflow问题：Extracting string between words using logical operators in rm_between function

安装开发版：

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

使用 qdapRegex 版本＆gt; = 4.1，您可以执行以下操作。

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"

使用rm_between函数

4 个答案:

数据

修改