分析R中的表以计算核苷酸频率

时间:2014-02-27 16:36:21

标签: r

我是R的新手,我相信我有一个字符串表,我从一个包含核苷酸列表的文本文件中提取(例如“AGCTGTCATGCT .....”)。

以下是文本文件的前两行作为示例:

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC

我需要通过递增变量a来计算序列中的每个“A”。这同样适用于G,C和T(要增量的变量分别为g,c,t)。

在“for”循环结束时,我想要出现“A”“G”“C”和“T”核苷酸的次数,这样我就可以计算出二核苷酸频率,并且可以转换矩阵。我的代码在下面,它不起作用,它只返回每个变量等于0这是错误的。请帮忙,谢谢!

#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0

for(i in dnaseq[[1]]){
    if(i=="A") (inc(a)<-1)
    if(i=="G") (inc(g)<-1)
    if(i=="C") (inc(c)<-1)
    if(i=="T") (inc(t)<-1)
}
a
g
c
t

3 个答案:

答案 0 :(得分:2)

获取每个核苷酸(或任何类型的字母)计数的最简单方法是使用tablestrsplit函数。例如:

myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"

# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
#  [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"

# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A  C  G  T 
# 20 12 17 21 

现在,如果您不关心一行与下一行之间的区别(如果这只是ecoli.txt中的一个长序列),那么您希望首先将该文件合并为一个长字符串:

table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])

这是一线解决方案,但可以更清楚地看到它分为三行:

combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)

如果您想知道原始代码首先出现了什么问题,我不知道inc函数来自哪里(为什么它没有抛出错误:你确定{{1长度大于0?)但无论如何,你没有迭代序列,你在迭代线。 dnaseq[[1]]永远不会是iA这样的单个字符,它始终是一个完整的字符。

在任何情况下,Tcollapsetable的解决方案都比for循环(或一对嵌套for循环更简洁,计算效率更高,这是什么你需要)。

答案 1 :(得分:1)

您可以使用以下代码从str_count包中调用stringr函数(计算固定文本模式的出现次数)。它应该比将字符串拆分成单字母子串的其他解决方案更快。

require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt") 
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
   str_count(dnaseq, fixed(nuc)))

注意,该解决方案可以容易地扩展到长度&gt; 1个子序列查找任务(只需更改sapply()中的搜索模式,例如更改为生成所有核苷酸对的as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c))

但请注意,检测 AGAGA 中的 AGA 只会报告1次,因为str_count()不会考虑重叠模式。

答案 2 :(得分:0)

我假设您的核苷酸序列在长度为1的字符向量中。如果您正在寻找二核苷酸频率和转换矩阵,这里有一个解决方案:

dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
           CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"

## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
   # Var1 Var2 two
# 1     A    A  AA
# 2     T    A  TA
# 3     G    A  GA
# 4     C    A  CA
# 5     A    T  AT
# 6     T    T  TT
# 7     G    T  GT
# 8     C    T  CT
# 9     A    G  AG
# 10    T    G  TG
# 11    G    G  GG
# 12    C    G  CG
# 13    A    C  AC
# 14    T    C  TC
# 15    G    C  GC
# 16    C    C  CC

## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two, 
  function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]), 
  integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC 
# 11 11  7  5  9 12  9 13  7 13  4  2  8  7  5  2

## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide", 
  idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
   # A  T  G C
# A 11  9  7 8
# T 11 12 13 7
# G  7  9  4 5
# C  5 13  2 2

## get margin proportions for transition matrix 
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
#           A         T          G          C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909