我试图在单词之间提取字符串。考虑这个例子 -
x <- "There are 2.3 million species in the world"
这也可能采取另一种形式
x <- "There are 2.3 billion species in the world"
我需要There
与&#39; million
或billion
之间的文字,包括它们。百万或十亿的存在是在运行时决定的,它不是事先决定的。所以我从这句话中得到的输出是
[1] There are 2.3 million
或
[2] There are 2.3 billion
我正在使用rm_between
包中的qdapRegex
函数。使用此命令,我一次只能提取其中一个。
library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)
或者我必须使用
rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)
如何编写可以检查同一句子中million
或billion
是否存在的命令。像这样的东西 -
rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)
我希望这很清楚。任何帮助,将不胜感激。
答案 0 :(得分:3)
您可以使用str_extact_all
(全局匹配)或str_extract
(单一匹配)
library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")
或
str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))
答案 1 :(得分:3)
left
中的right
和rm_between
参数需要vector
个字符/数字符号。因此,您可以在left/right
个参数中使用长度相等的向量。
library(qdapRegex)
unlist(rm_between(x, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million" "There are 2.3 billion"
unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 million"
unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
extract=TRUE, include.markers=TRUE))
#[1] "There are 2.3 billion"
或者
sub('\\s*species.*', '', x)
x <- c("There are 2.3 million species in the world",
"There are 2.3 billion species in the world")
x1 <- "There are 2.3 million species in the world"
x2 <- "There are 2.3 billion species in the world"
答案 2 :(得分:2)
使用 rm_between
,您可以提供与doc状态等长的多个标记的向量。
有关rm_between
的更新参数,请参阅@ TylerRinker的answer。
虽然,您可以使用用户定义的正则表达式的另一种方法是rm_default
:
rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)
示例强>:
library(qdapRegex)
x <- c(
'There are 2.3 million species in the world',
'There are 2.3 billion species in the world'
)
rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)
## [[1]]
## [1] "There are 2.3 million"
## [[2]]
## [1] "There are 2.3 billion"
答案 3 :(得分:2)
@ hwnd&#39; s(我的同事 qdapRegex 共同作者)的回复激发了一场讨论,为fixed
带来了新的论点rm_between
。以下描述在开发版本中:
rm_between
和r_between_multiple
获取fixed
参数。以前,默认情况下会修复包含正则表达式特殊字符的left
和right
边界(转义)。这不允许对左/右边界强有力地使用正则表达式。fixed = TRUE
行为仍然是默认行为,但用户现在可以设置fixed = FALSE
以使用正则表达式边界。这个新功能的灵感来自@Ronak Shah的StackOverflow问题:Extracting string between words using logical operators in rm_between function
安装开发版:
if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")
使用 qdapRegex 版本&gt; = 4.1,您可以执行以下操作。
x <- c(
"There are 2.3 million species in the world",
"There are 2.3 billion species in the world"
)
rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
include=TRUE, extract = TRUE)
## [[1]]
## [1] "There are 2.3 million"
##
## [[2]]
## [1] "There are 2.3 billion"