查找没有任何包装的基因组组合

时间:2019-01-21 13:05:07

标签: r seq

我想找到一个序列中有多少个基因组组合。我的意思是对于二进制组合:AA,AT,AG,AC ...等16种组合;或对于3元素组合ATG,ACG ... 64种组合。我知道如何用一个包装来做,我会在这里写下来。我想创建自己的代码来执行此操作

seqinr软件包非常适合其工作。那是我用来的代码;

install.packages('seqinr')    
library(seqinr)    
m = read.fasta(file='sequence.fasta')     
mseq = m[[1]]     
count(mseq,2)   # gives how many binary combinations are found in the seq     
count(mseq,3)   # gives how many 3-elemented combinations are found in the seq

2 个答案:

答案 0 :(得分:2)

这是一个很慢的方法。我敢肯定,在生物导体包装中速度更快。

# some practice data
mseq = paste(sample(c("A", "C", "G", "T"), 1000, rep=T), collapse="")

# define a function called count
count = function(mseq, n){
  # split the sequence into every possible sub sequence of length n
  x = sapply(1:(nchar(mseq) - n + 1), function(i) substr(mseq, i, i+n-1))
  # how many unique sub sequences of length R are there?
  length(table(x))
}

实际上只是检查了一下,这几乎就是他们的做法:

function (seq, wordsize, start = 0, by = 1, freq = FALSE, alphabet = s2c("acgt"), 
    frame = start) 
{
    if (!missing(frame)) 
        start = frame
    istarts <- seq(from = 1 + start, to = length(seq), by = by)
    oligos <- seq[istarts]
    oligos.levels <- levels(as.factor(words(wordsize, alphabet = alphabet)))
    if (wordsize >= 2) {
        for (i in 2:wordsize) {
            oligos <- paste(oligos, seq[istarts + i - 1], sep = "")
        }
    }
    counts <- table(factor(oligos, levels = oligos.levels))
    if (freq == TRUE) 
        counts <- counts/sum(counts)
    return(counts)
}

如果要查找函数的代码,请使用getAnywhere()

getAnywhere(count)

答案 1 :(得分:0)

简单的事情就是这样:

# Generate a test sequence
set.seed(1234)
testSeq <- paste(sample(LETTERS[1:3], 100, replace = T), collapse = "")

# Split string into chunks of size 2 and then count occurrences
testBigram <- substring(testSeq, seq(1, nchar(testSeq), 2), seq(2, nchar(testSeq), 2))
table(testBigram)

AA AB AC BA BB BC CA CB CC 
10 10 14  3  3  2  2  5  1 

这是使用“函数工厂”(https://adv-r.hadley.nz/function-factories.html)的一种方式。

2元素和3元素的组合是大小为2和3的n-gram。因此,我们将此n-gram函数工厂化。

# Generate a function to create a function
ngram <- function(size) {
  function(myvector) {

    substring(myvector, seq(1, nchar(myvector), size), seq(size, nchar(myvector), size))

  }
}

# Assign the functions names (optional)
bigram <- ngram(2)
trigram <- ngram(3)

# 2 element combinations
table(bigram(testSeq))

AA AB AC BA BB BC CA CB CC 
10 10 14  3  3  2  2  5  1 

# count of 2 element combinations
length(unique(bigram(testSeq)))

[1] 9

# counting function
count <- function(mseq, n) length(unique(ngram(n)(mseq)))
count(testSeq, 2)

[1] 9

# and if we wanted to do with with 3 element combinations
table(trigram(testSeq))