myseq=DNAStringSet(c("ATGACGAACTGTAAAGGACTGCACGGCC",
"TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
"GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"))
我需要搜索的模式是
fragments= DNAStringSet(c("ACTG","AAAA"))
counts=vcountPDict(fragments,myseq)
我想以表格的形式编译信息,其中包含不同列中片段的DNA序列,片段和计数,以便能够很好地呈现。
答案 0 :(得分:0)
我刚刚在How to take a word and create an indicator variable based on the word's presence in comments?
问了这样的问题fragment <- c("ACTG","AAAA")
sequence <- c("ATGACGAACTGTAAAGGACTGCACGGCC",
"TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
"GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT")
sequence <- data.frame(sequence, sapply(fragment, function(i) as.numeric(grepl(i, sequence))))
sequence
sequence ACTG AAAA
1 ATGACGAACTGTAAAGGACTGCACGGCC 1 0
2 TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG 1 1
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT 1 0
从这里你可以去做
seq.counts <- colSums(sequence[,2:3])
seq.means <- colMeans(sequence[,2:3])
或者你可以做到
sequence <- data.frame(sequence, sapply(fragment, function(i) str_count(sequence, i)))
结果是
> sequence
sequence ACTG AAAA
1 ATGACGAACTGTAAAGGACTGCACGGCC 2 0
2 TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG 1 2
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT 1 0
您可以使用colSums
或colMeans
。
答案 1 :(得分:0)
library(Biostrings)
myseq = DNAStringSet(c(
"ATGACGAACTGTAAAGGACTGCACGGCC",
"TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG",
"GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT"
))
fragments = DNAStringSet(c("ACTG","AAAA"))
counts = vcountPDict(fragments, myseq)
# then use command of expand.grid and name two of the columns
a = expand.grid(seqID = myseq, pattern = fragments)
# Another variable to account for the frequency of each pattern
b = expand.grid(freq = counts)
r = cbind.data.frame(a, b)
结果是
> r
seqID pattern freq
1 ATGACGAACTGTAAAGGACTGCACGGCC ACTG 2
2 TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG ACTG 0
3 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT ACTG 1
4 ATGACGAACTGTAAAGGACTGCACGGCC AAAA 2
5 TCCAACGAGAAAACCTGTGGGCACGGCCAAAACTGTTGGG AAAA 1
6 GGCGGGGACAAATGTTCCATGACTGGCCTTTAAAGGCCTAGAT AAAA 0