想象一下,我有一组字符串,比如说:
#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"
现在我想检查某些模式是否出现在序列的第一个,中间或最后三分之一处。因此,我希望能够制定出类似的规则:
Match the string if, and only if,
marker X occurs in the first/middle/last third of the string
例如,我可能希望匹配前三分之一A
的字符串。考虑上面的序列,我会匹配#1
和#2
。我还想要匹配最后三分之一A
的字符串。这将匹配#2
和#3
。
如何编写通用代码/正则表达式模式,可以将此类规则作为输入,然后匹配相应的子序列?
答案 0 :(得分:5)
这是一个完全向量化的尝试(您可以使用设置并告诉我您是否要添加/更改内容)
StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
xsub <- gsub("-", "", x, fixed = TRUE)
sizes <- nchar(xsub) / frac
subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}
测试你的载体
x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3
或者如果你想要实际匹配的字符串
StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C" "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"
请注意,当字符串大小并不精确地除以3时(例如,nchar(x) == 10
),默认情况下最后一个是最大的组(例如nchar(x) == 10
时的大小为4)
答案 1 :(得分:2)
这是一个生成正则表达式以满足所需要求的解决方案。注意正则表达式可以计数,但它们不能相对于总字符串计数。因此,这会根据每个输入字符串的长度生成一个自定义正则表达式。我使用了stringi::stri_detect_regex
而不是grep
,因为后者在模式术语中没有被矢量化。我还假设pattern
参数本身是一个有效的正则表达式,并且任何需要转义的字符(例如[
,.
)都会被转义。
library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
before <- round(nchar(strings) / n_groups * (which_fraction - 1))
after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"
#Test regexs:
stri_detect_regex(strings, patterns)
#[1] TRUE TRUE FALSE
答案 2 :(得分:1)
这是一个选项:
f <- function(txts, needle, operator, threshold) {
require(stringi)
txts <- gsub("-", "", txts, fixed = TRUE) # delete '-'s
matches <- stri_locate_all_fixed(txts, needle) # find matches
ends <- lapply(matches, function(x) x[, "end"]) # extract endposition of matches (= start)
ends <- mapply("/", ends, sapply(txts, nchar) + 1) # divide by string length+1
which(sapply(mapply(operator, ends, threshold), any)) # return indices of matches that fulfill restriction of operator and its threshold
}
txts <- c("A-A-B-B-C-C", "A-A-A-A-A-A", "B-B-B-C-A-A")
idx <- f(txts, needle = "A", operator = "<=", threshold = .333)
txts[idx]
# [1] "A-A-B-B-C-C" "A-A-A-A-A-A"