通过子模式子集DNAStringSet并删除R中的子模式

时间:2017-02-18 05:40:59

标签: r regex bioinformatics fasta bioconductor

我想仅对包含子字符串的行进行子集化,然后删除子字符串。我可以做第一部分,但我不知道如何删除子字符串

这是一个例子

library(Biostrings)
myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG"))
names(myseq) <- letters[1:3]
myseq

A DNAStringSet instance of length 3
width seq                                                                                                           names               
[1]    40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                                                                    a
[2]    17 CCCATGAACATAGATCC                                                                                           b
[3]    17 CCCGTACAGATCACGTG                                                                                           c

我要删除的序列是 AGATCGGAAGAGCACACGTCTGAA ,它位于第一行。

matchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq[[1]])

Views on a 40-letter DNAString subject
subject: CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA
views:
    start end width
[1]     9  32    24 [AGATCGGAAGAGCACACGTCTGAA]

到子集我执行以下操作:

pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq)
myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ]

A DNAStringSet instance of length 3
    width seq                                                                                                         names               
[1]    40 CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA                                                                    a
[2]     0                                                                                                             b
[3]     0                                                                                                             c

输出应为

A DNAStringSet instance of length 3
    width seq                                                                                                         names               
[1]    11 CCCCCCATGAA                                                                                                 a
[2]     0                                                                                                             b
[3]     0                                                                                                             c

2 个答案:

答案 0 :(得分:1)

您可以使用vcountPattern来计算ifelse语句中的匹配项,将匹配项替换为str_replace的输出项,将不匹配项替换为空字符串:

myseq2 <- DNAStringSet(
            unlist(
              lapply(
                vcountPattern(
                 'AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, 
                  ifelse, 
                  str_replace(
                    myseq, 
                   'AGATCGGAAGAGCACACGTCTGAA', 
                   ''),
                '')
              )
            )
names(myseq2) <- names(myseq)
myseq2

>A DNAStringSet instance of length 3
>width seq                                                     names               
>[1]    16 CCCATGAACCCATGAA                                        a
>[2]     0                                                         b
>[3]     0                                                         c

使用管道符号略微可读:

lapply(vcountPattern('AGATCGGAAGAGCACACGTCTGAA', myseq) > 0, ifelse, str_replace(myseq, 'AGATCGGAAGAGCACACGTCTGAA', ''), '') %>%
    unlist() %>%
    DNAStringSet() -> myseq2

答案 1 :(得分:0)

我不熟悉生物信息学包,但如果您可以将数据转换为列表(我相信应该可以将列表转换为包中使用的格式),可以使用以下方法:

1)使用stringr库删除所需的模式 2)计算新模式的长度

# load biostrings package
library(Biostrings)

# create sample dataset
myseq <-DNAStringSet(c("CCCATGAAAGATCGGAAGAGCACACGTCTGAACCCATGAA", "CCCATGAACATAGATCC", "CCCGTACAGATCACGTG"))
names(myseq) <- letters[1:3]

# remove sequences with no match
pat <- vmatchPattern("AGATCGGAAGAGCACACGTCTGAA", myseq)
data <- myseq[ lapply(lapply(pat, isEmpty), function(x) x == FALSE) ]

# load stringr library
library(stringr)

# replace the matched sequence
test <- lapply(test, str_replace, "AGATCGGAAGAGCACACGTCTGAA", "")
# put together the new sequence and its length
test <- mapply(c, lapply(test, nchar), test, SIMPLIFY = FALSE)