R正则表达式 - 在括号之间分割

时间:2015-08-18 17:46:39

标签: regex r

假设我有一个字符串x,我想像这样拆分它:

x <- "(A|C|T)AG(C|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)(A|C|G|T)GCC(C|T)(A|C|G|T)(A|C|G|T)(A|C|G)"

# Desired output
[1]  "(A|C|T)"  "A"  "G"  "(C|T)"  "(A|C|G|T)"  "(A|C|G|T)"  "(A|C|G|T)"  
[8]  "(A|C|G|T)"  "(A|C|G|T)"  "G"  "C"  "C"  "(C|T)"  "(A|C|G|T)"  
[15] "(A|C|G|T)"  "(A|C|G)"  

我正在使用此分割功能,但我无法分割不在括号中的字符串。处理这个正则表达式问题的最佳方法是什么?

splitme <- function(x) {
  x <- unlist(strsplit(x, "(?=\\()", perl=TRUE))
  x <- unlist(strsplit(x, "(?<=\\))", perl=TRUE))
  for (i in which(x=="(")) {
    x[i+1] <- paste(x[i], x[i+1], sep="")
  }
  x[-which(x=="(")]
}

splitme(x)
 [1] "(A|C|T)"   "AG"        "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "GCC"      
[10] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"  

3 个答案:

答案 0 :(得分:3)

这样的事情应该有效:

> library(stringi)

> unlist(stri_extract_all_regex(x, "\\([ACGT\\|]*\\)|[ACGT]"))
 [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)"
 [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G"         "C"         "C"        
[13] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"  

\\([ACGT\\|]*\\)将匹配括号中的所有内容和[ACGT]剩余的基数。

答案 1 :(得分:2)

您似乎希望在每个)后分隔字符串,并在每个字母之后分隔另一个字母或(。如果这是您想要的行为,您可以使用:

pat <- "(?<=\\))|(?<=[[:alpha:]])(?=[[:alpha:]\\(])"
strsplit(x, pat, perl=TRUE)[[1]]
#  [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)"
#  [7] "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "G"         "C"         "C"        
# [13] "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)" 

答案 2 :(得分:1)

要拆分单个字母,您只需运行strsplit(x, "")即可。您所要做的就是确保不要将其应用于“已完成”的字符串(即带括号的字符串)。

y = splitme(x)
Indices = !which(grepl(y, "\\("))
y[Indices] = strsplit(y[Indices], "")
unlist(y)
 [1] "(A|C|T)"   "A"         "G"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)" "(A|C|G|T)"
 [9] "(A|C|G|T)" "G"         "C"         "C"         "(C|T)"     "(A|C|G|T)" "(A|C|G|T)" "(A|C|G)"