在biostring包中搜索多个DNA序列中的多个模式

时间:2017-04-27 12:05:18

标签: r string

例如,我有3个序列

myseq=DNAStringSet(c("ATGACGAACTGTAAAGGACTGCACGGCC",
                     "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
                     "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"))

我需要搜索的模式是

fragments= DNAStringSet(c("ACTG","AAAA"))

counts=vcountPDict(fragments,myseq)

我想以表格的形式编译信息,其中包含不同列中片段的DNA序列,片段和计数,以便能够很好地呈现。

2 个答案:

答案 0 :(得分:0)

我刚刚在How to take a word and create an indicator variable based on the word's presence in comments?

问了这样的问题
fragment <- c("ACTG","AAAA") 
sequence <- c("ATGACGAACTGTAAAGGACTGCACGGCC",
              "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
              "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT")

sequence <- data.frame(sequence, sapply(fragment, function(i) as.numeric(grepl(i, sequence))))     
sequence
                                      sequence ACTG AAAA
 1                ATGACGAACTGTAAAGGACTGCACGGCC    1    0
 2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    1    1
 3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    1    0

从这里你可以去做

seq.counts <- colSums(sequence[,2:3])
seq.means  <- colMeans(sequence[,2:3])

或者你可以做到

sequence <- data.frame(sequence, sapply(fragment, function(i) str_count(sequence, i)))

结果是

> sequence
                                     sequence ACTG AAAA
1                ATGACGAACTGTAAAGGACTGCACGGCC    2    0
2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    1    2
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    1    0

您可以使用colSumscolMeans

答案 1 :(得分:0)

library(Biostrings)
myseq = DNAStringSet(c(
    "ATGACGAACTGTAAAGGACTGCACGGCC",
    "TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
    "GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"
))

fragments = DNAStringSet(c("ACTG","AAAA"))
counts = vcountPDict(fragments, myseq)

# then use command of expand.grid and name two of the columns
a = expand.grid(seqID = myseq, pattern = fragments)

# Another variable to account for the frequency of each pattern 
b = expand.grid(freq = counts)
r = cbind.data.frame(a, b)

结果是

> r
                                        seqID pattern freq
1                ATGACGAACTGTAAAGGACTGCACGGCC    ACTG    2
2    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    ACTG    0
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    ACTG    1
4                ATGACGAACTGTAAAGGACTGCACGGCC    AAAA    2
5    TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG    AAAA    1
6 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT    AAAA    0