使用rm_between函数

时间:2015-07-25 04:52:36

标签: r string qdapregex

我试图在单词之间提取字符串。考虑这个例子 -

x <-  "There are 2.3 million species in the world"

这也可能采取另一种形式

x <-  "There are 2.3 billion species in the world"

我需要There与&#39; millionbillion之间的文字,包括它们。百万或十亿的存在是在运行时决定的,它不是事先决定的。所以我从这句话中得到的输出是

[1] There are 2.3 million
[2] There are 2.3 billion

我正在使用rm_between包中的qdapRegex函数。使用此命令,我一次只能提取其中一个。

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE) 

或者我必须使用

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

如何编写可以检查同一句子中millionbillion是否存在的命令。像这样的东西 -

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

我希望这很清楚。任何帮助,将不胜感激。

4 个答案:

答案 0 :(得分:3)

您可以使用str_extact_all(全局匹配)或str_extract(单一匹配)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))

答案 1 :(得分:3)

left中的rightrm_between参数需要vector个字符/数字符号。因此,您可以在left/right个参数中使用长度相等的向量。

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

或者

  sub('\\s*species.*', '', x)

数据

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

答案 2 :(得分:2)

使用rm_between,您可以提供与doc状态等长的多个标记的向量。

修改

有关rm_between的更新参数,请参阅@ TylerRinker的answer

虽然,您可以使用用户定义的正则表达式的另一种方法是rm_default

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

示例

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"

答案 3 :(得分:2)

@ hwnd&#39; s(我的同事 qdapRegex 共同作者)的回复激发了一场讨论,为fixed带来了新的论点rm_between。以下描述在开发版本中:

  

rm_betweenr_between_multiple获取fixed参数。以前,默认情况下会修复包含正则表达式特殊字符的leftright边界(转义)。这不允许对左/右边界强有力地使用正则表达式。 fixed = TRUE行为仍然是默认行为,但用户现在可以设置fixed = FALSE以使用正则表达式边界。这个新功能的灵感来自@Ronak Shah的StackOverflow问题:Extracting string between words using logical operators in rm_between function

安装开发版:

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

使用 qdapRegex 版本&gt; = 4.1,您可以执行以下操作。

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"