我是R的新手,我相信我有一个字符串表,我从一个包含核苷酸列表的文本文件中提取(例如“AGCTGTCATGCT .....”)。
以下是文本文件的前两行作为示例:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
我需要通过递增变量a来计算序列中的每个“A”。这同样适用于G,C和T(要增量的变量分别为g,c,t)。
在“for”循环结束时,我想要出现“A”“G”“C”和“T”核苷酸的次数,这样我就可以计算出二核苷酸频率,并且可以转换矩阵。我的代码在下面,它不起作用,它只返回每个变量等于0这是错误的。请帮忙,谢谢!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
答案 0 :(得分:2)
获取每个核苷酸(或任何类型的字母)计数的最简单方法是使用table
和strsplit
函数。例如:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
现在,如果您不关心一行与下一行之间的区别(如果这只是ecoli.txt中的一个长序列),那么您希望首先将该文件合并为一个长字符串:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
这是一线解决方案,但可以更清楚地看到它分为三行:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
如果您想知道原始代码首先出现了什么问题,我不知道inc
函数来自哪里(为什么它没有抛出错误:你确定{{1长度大于0?)但无论如何,你没有迭代序列,你在迭代线。 dnaseq[[1]]
永远不会是i
或A
这样的单个字符,它始终是一个完整的字符。
在任何情况下,T
,collapse
和table
的解决方案都比for循环(或一对嵌套for循环更简洁,计算效率更高,这是什么你需要)。
答案 1 :(得分:1)
您可以使用以下代码从str_count
包中调用stringr
函数(计算固定文本模式的出现次数)。它应该比将字符串拆分成单字母子串的其他解决方案更快。
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
注意,该解决方案可以容易地扩展到长度&gt; 1个子序列查找任务(只需更改sapply()
中的搜索模式,例如更改为生成所有核苷酸对的as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c))
。
但请注意,检测 AGAGA 中的 AGA 只会报告1次,因为str_count()
不会考虑重叠模式。
答案 2 :(得分:0)
我假设您的核苷酸序列在长度为1的字符向量中。如果您正在寻找二核苷酸频率和转换矩阵,这里有一个解决方案:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909