Question

想象一下，我有一组字符串，比如说：

#1: "A-B-B-C-C"
#2: "A-A-A-A-A-A-A"
#3: "B-B-B-C-A-A"

现在我想检查某些模式是否出现在序列的第一个，中间或最后三分之一处。因此，我希望能够制定出类似的规则：

Match the string if, and only if, 
marker X occurs in the first/middle/last third of the string

例如，我可能希望匹配前三分之一A的字符串。考虑上面的序列，我会匹配#1和#2。我还想要匹配最后三分之一A的字符串。这将匹配#2和#3。

如何编写通用代码/正则表达式模式，可以将此类规则作为输入，然后匹配相应的子序列？

Answer 1

这是一个完全向量化的尝试（您可以使用设置并告诉我您是否要添加/更改内容）

StriDetect <- function(x, seg = 1L, pat = "A", frac = 3L, fixed = TRUE, values = FALSE){
  xsub <- gsub("-", "", x, fixed = TRUE)
  sizes <- nchar(xsub) / frac
  subs <- substr(xsub, sizes * (seg - 1L) + 1L, sizes * seg)
  if(isTRUE(values)) x[grep(pat, subs, fixed = fixed)] else grep(pat, subs, fixed = fixed)
}

测试你的载体

x <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
StriDetect(x, 1L, "A")
## [1] 1 2
StriDetect(x, 3L, "A")
## [1] 2 3

或者如果你想要实际匹配的字符串

StriDetect(x, 1L, "A", values = TRUE)
## [1] "A-B-B-C-C"     "A-A-A-A-A-A-A"
StriDetect(x, 3L, "A", values = TRUE)
## [1] "A-A-A-A-A-A-A" "B-B-B-C-A-A"

请注意，当字符串大小并不精确地除以3时（例如，nchar(x) == 10），默认情况下最后一个是最大的组（例如nchar(x) == 10时的大小为4）

Answer 2

这是一个生成正则表达式以满足所需要求的解决方案。注意正则表达式可以计数，但它们不能相对于总字符串计数。因此，这会根据每个输入字符串的长度生成一个自定义正则表达式。我使用了stringi::stri_detect_regex而不是grep，因为后者在模式术语中没有被矢量化。我还假设pattern参数本身是一个有效的正则表达式，并且任何需要转义的字符（例如[，.）都会被转义。

library("stringi")
strings <- c("A-B-B-C-C", "A-A-A-A-A-A-A", "B-B-B-C-A-A")
get_regex_fn_fractions <- function(strings, pattern, which_fraction, n_groups = 3) {
  before <- round(nchar(strings) / n_groups * (which_fraction - 1))
  after <- round(nchar(strings) / n_groups * (n_groups - which_fraction))
  sprintf("^.{%d}.*%s.*.{%d}$", before, pattern, after)
}
(patterns <- get_regex_thirds(strs, "A", 1))
#[1] "^.{0}.*A.*.{6}$" "^.{0}.*A.*.{9}$" "^.{0}.*A.*.{7}$"

#Test regexs:
stri_detect_regex(strings, patterns)
#[1]  TRUE  TRUE FALSE

Answer 3

这是一个选项：

f <- function(txts, needle, operator, threshold) {
  require(stringi)
  txts <- gsub("-", "", txts, fixed = TRUE)             # delete '-'s
  matches <- stri_locate_all_fixed(txts, needle)        # find matches 
  ends <- lapply(matches, function(x) x[, "end"])       # extract endposition of matches (= start)
  ends <- mapply("/", ends, sapply(txts, nchar) + 1)    # divide by string length+1
  which(sapply(mapply(operator, ends, threshold), any)) # return indices of matches that fulfill restriction of operator and its threshold
}
txts <- c("A-A-B-B-C-C", "A-A-A-A-A-A", "B-B-B-C-A-A")
idx <- f(txts, needle = "A", operator = "<=", threshold = .333)
txts[idx]
# [1] "A-A-B-B-C-C" "A-A-A-A-A-A"

根据子串出现在字符串中的位置来识别字符串

3 个答案: