在知道一些子字符串的情况下分割字符串

时间:2019-04-11 18:00:42

标签: r regex

说我有以下字符串和子字符串向量:

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

我想通过从向量中提取子字符串并从中间的字符中创建新的子字符串来分割此字符串,所以我将得到以下信息:

res <- c("abc", "[[", "+", "de.f", "[", "-", "[[", "g")

如果比赛相冲突,则获胜时间较长(此处[[胜过[),您可以认为不会有相同长度的比赛相冲突。

使用regex进行标记,但可以使用任何解决方案,更快更好。

除了这些字符串是ASCII以外,请不要对这些字符串中使用的字符类型做任何假设。如果我未明确提及,则无法推断出任何模式。


另一个示例:

x <- "a*bc[[+de.f[-[[g[*+-h-+"
v <- c("+", "-", "[", "[[", "[*", "+-")
res <- c("a*bc", "[[", "+", "de.f", "[", "-", "[[", "g", "[*", "+-", "h", "-", "+")

4 个答案:

答案 0 :(得分:2)

使用stringr::str_match_allHmisc::escapeRegex

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")
tmp <- v[order(-nchar(v))] # sort to have longer first, to match in priority
tmp <- Hmisc::escapeRegex(tmp)
tmp <- paste(tmp,collapse="|")  # compile a match string
pattern <- paste0(tmp,"|(.+?)") # add a pattern to match the rest
# extract all matches into a matrix
mat <- stringr::str_match_all(op_chr, pattern)[[1]]
# aggregate where second column is NA
res <- unname(tapply(mat[,1], 
                     cumsum(is.na(mat[,2])) + c(0,cumsum(abs(diff(is.na(mat[,2]))))),
                     paste, collapse=""))
res
#> [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"

答案 1 :(得分:2)

这似乎比匹配问题更像是词汇问题。 minilexer package

似乎获得了不错的结果
library(minilexer) #devtools::install_github("coolbutuseless/minilexer")

patterns <- c(
  dbracket  = "\\[\\[", 
  bracket   = "\\[",
  plus      = "\\+",
  minus     = "\\-",
  name      = "[a-z.]+"
)

x <- "abc[[+de.f[-[[g"
lex(x, patterns)
unname(lex(x, patterns))
# [1] "abc"  "[["   "+"    "de.f" "["    "-"   
# [7] "[["   "g" 

答案 2 :(得分:1)

获取匹配项的一种选择可能是向我们发送alternation

[a-z.]+|\[+|[+-]
  • [a-z.]+匹配a + z或点1+次
  • |
  • \[+匹配[的1次以上
  • |`或
  • [+-]匹配+-

Regex demo | R demo

例如,要获得匹配项:

library(stringr)
x <- "abc[[+de.f[-[[g"
str_extract_all(x, "[a-z.]+|\\[+|[+-]")

答案 3 :(得分:1)

基于纯正则表达式的解决方案看起来像

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]

pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
pat <- paste0("(?s)", pat, "|(?:(?!", pat, ").)+")
res <- regmatches(x, gregexpr(pat, x, perl=TRUE))
## => [[1]]
##    [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"

请参阅此R demo online。这里的PCRE正则表达式是

(?s)\[\[|\+|-|\[|(?:(?!\[\[|\+|-|\[).)+

请参阅regex演示和Regulex图:

enter image description here

详细信息

  • (?s)-一个DOTALL修饰符,使.匹配包括换行符在内的任何字符
  • \[\[-[[子字符串(以regex.escape进行转义)
  • |-或
  • \+-一个+
  • |--或-(无需转义-,因为它不在字符类中)
  • |\[-或[
  • |-或
  • (?:(?!\[\[|\+|-|\[).)+-一个与任何字符(.)匹配的tempered greedy token,并重复了尽可能多的1个或更多重复(末尾+),但没有开始一个[[+-[字符序列(了解有关tempered greedy token的更多信息)。

您还可以考虑使用TRE正则表达式来减少“正则表达式密集型”解决方案:

x <- "abc[[+de.f[-[[g"
v <- c("+", "-", "[", "[[")

## Escaping function
regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?])", "\\\\\\1", string)
}
## Sorting by length in the descending order function
sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
## Interleaving function
riffle3 <- function(a, b) { 
  mlab <- min(length(a), length(b)) 
  seqmlab <- seq(length=mlab) 
  c(rbind(a[seqmlab], b[seqmlab]), a[-seqmlab], b[-seqmlab]) 
} 
pat <- paste(regex.escape(sort.by.length.desc(v)), collapse="|")
res <- riffle3(regmatches(x, gregexpr(pat, x), invert=TRUE)[[1]], regmatches(x, gregexpr(pat, x))[[1]])
res <- res[res != ""]
## => [1] "abc"  "[["   "+"    "de.f" "["    "-"    "[["   "g"   

请参见R demo

因此,将搜索项适当地转义以用于正则表达式,将它们按长度降序排序,动态构建基于交替的正则表达式模式,然后找到所有匹配和不匹配的字符串,然后将它们加入单个字符向量中,最后丢弃空项目。