Question

我有类似的字符串：

the.string <- "982y987r0jhABCdioy2093uiwhf"

我也有一个像这样的子字符串向量：

the.substrings <- c("ABC", "DEF", "GHI", "987")

我想形成一个新的向量，它仅包含the.substrings中前{2}个出现的2个事件，并按出现的顺序排列。因此，在上面的示例中，我们只想按此顺序the.string和"987" 。

我已经使用以下算法实现了这一点：

将"ABC"的每一个都放在上面，并搜索每个出现的地方。
如果发生，请保存子字符串及其出现的位置。
退出循环时，请使用步骤2中保存的位置对出现的事件进行排序：

the.substrings

这似乎可以正常运行：

mod.str <- list(2)
pos.str <- numeric(2)
n <- 1

for (i in 1:length(the.substrings)) {
  reg.search <- gregexpr(the.substrings[i], the.string)
  if(reg.search[[1]][1] > 0) {
    mod.str[n] <- the.substrings[i]
    pos.str[n] <- reg.search[[1]][1]
    n <- n + 1
  }
}

dtfoo <- as.data.frame(cbind(mod.str, pos.str))
dtfoo <- as.data.frame(lapply(dtfoo, unlist))

as.character(dtfoo[order(dtfoo$pos.str),][, 1])

但是我想知道是否有更好的方法（更有效，更不易出错，也许利用功能性编程方法）来实现这一目标？

Answer 1

您可以像这样使用stringr中的函数：

library(stringr)

首先提取匹配的字符串的位置

string.locations <- str_locate(the.string, the.substrings)
string.locations 
#      start end
# [1,]    12  14
# [2,]    NA  NA
# [3,]    NA  NA
# [4,]     5   7

按起点对它们进行排序，并仅提取前两个：

string.locations <- string.locations[order(string.locations[, 1]), ]
string.locations.sub <- string.locations[1:2, ]
string.locations.sub 
#      start end
# [1,]     5   7
# [2,]    12  14

然后仅按那些位置对原始字符串进行子集化：

str_sub(the.string, string.locations.sub)
# [1] "987" "ABC"

Answer 2

您可以使用以下基本R解决方案：

regmatches(the.string, gregexpr(paste(the.substrings, collapse="|"), the.string))

重点是您使用the.substrings来构建一个正则表达式，该正则表达式包含与| json.dumps()和regmatches / { {1}}将按从左到右的顺序提取输入中所有出现的模式。

如果gregexpr为ABC|DEF|GHI|987|ABCDE，则模式将类似于the.substrings。由于此c("ABC", "DEF", "GHI", "987", "ABCDE")调用中使用的正则表达式引擎为 TRE ，因此交替模式按照alternation operator中所述的方式进行匹配：

当文本导向引擎在gregexpr上尝试Get|GetValue|Set|SetValue时，它将在字符串的开头尝试正则表达式的所有置换。它如此高效，没有任何回溯。它可以看到正则表达式可以在字符串的开头找到匹配项，并且匹配的文本可以是SetValue或Set。因为文本导向的引擎对正则表达式进行了整体评估，所以它没有一个替代项在另一个替代项之前列出的概念。但是它必须选择返回哪个匹配项。它总是返回最长的匹配项，在这种情况下为SetValue。

如果通过 SetValue 使用相同的方法（与paste(the.substrings, collapse="|")一起使用，则可能会得到不同的匹配集，因为在ICU regex引擎中会检查替代项记住正则表达式引擎很急部分中所述的方式。要点是，当找到匹配的替代项时，其余项（右侧）甚至都不会尝试。如果运行以下代码，您可能会很容易看到这一点：

stringr::str_extract_all

由于> the.string <- "ABCDE982y987r0jhABCdioy2093uiwhf" > the.substrings <- c("ABC", "DEF", "GHI", "987", "ABCDE") > str_extract_all(the.string, str_c(the.substrings, collapse = "|")) [[1]] [1] "ABC" "987" "ABC" > regmatches(the.string, gregexpr(paste(the.substrings, collapse="|"), the.string)) [[1]] [1] "ABCDE" "987" "ABC"在ABC之前，因此ABCDE返回stringr::str_extract_all匹配项（ABC与任何替代项均不匹配，因此被跳过），并且DE检查所有可能的匹配并返回最长的gregexpr。

另外，请参见Text-Directed Engine Returns the Longest Match section at regular-expressions.info。

Answer 3

也使用stringr：

library(stringr)
str_extract_all(the.string, str_c(the.substrings, collapse = "|"))[[1]][1:2]
[1] "987" "ABC"

按顺序提取子字符串

3 个答案: