我有100个fasta文件,我想绘制遗传距离矩阵的重叠直方图,看看DNA数据的bootstrap重复之间有多少重叠?
我已经想出如何使用以下方式让ape读取每个文件:
files <- list.files("/Volumes/ALEX_R-HD/", pattern="/Volumes/ALEX_R-HD/xii-27-D")
library("ape")
library("pegas")
library("plyr")
library("dostats")
filenames <- dir(path="/Volumes/ALEX_R-HD/xii-27_D_coccus", full.names="TRUE", pattern="xii-27")
listOfiles <- lapply(filenames, function(x) read.dna(x, format="fasta"))
然后使用以下方法为每个生成遗传距离矩阵:
distOfiles <- lapply(listOfiles, function(y) dist.dna(y, model="TN93"))
当我从R控制台调用它们时,遗传距离文件看起来像这样:
[[1]]
M_51_1_new__ M_51_3_new__ M_51_4_new2__ M_51_5_new2__ M_51_6_new__ M_51_7_new__ M_51_8_new__ madera_1_new__ madera_2_new__ madera_3__ madera_4_new__ madera_5_new__
M_51_3_new__ 0.000000000
M_51_4_new2__ 0.000000000 0.000000000
M_51_5_new2__ 0.000000000 0.000000000 0.000000000
M_51_6_new__ 0.000000000 0.000000000 0.000000000 0.000000000
M_51_7_new__ 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
M_51_8_new__ 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
madera_1_new__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343
madera_2_new__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.000000000
madera_3__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.000000000
and goes on to.... [[100]]
我遇到麻烦的是绘制每个的直方图,以便每个引导程序将在同一个窗口中的另一个上面绘制,下面的脚本只是在一个全新的窗口中绘制每个引导程序并且不重叠它们:
bins=seq(0,0.05,by=0.001)
HistOfiles <- lapply(distOfiles, function(z) hist(z, breaks=bins, main="Histogram of D. coccus Mexico-types TN93 distances", ylim=c(0,1500), xlab="TN93 distance", ylab="frequency", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01)))
我知道这可以通过以下方式完成:
bins=seq(0,0.05,by=0.001)
readfile1 <- read.dna("/Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus1", format="fasta")
distance_TN931 <- dist.dna(readfile1, model="TN93")
bins=seq(0,0.05,by=0.001)
hist(distance_TN931, breaks=bins, main="Histogram of D. coccus Mexico-types TN93 distances", ylim=c(0,1500), xlab="TN93 distance", ylab="frequency", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01))
lines(density(distance_TN931), col=rgb(1,0,0,0.01))
par(new=TRUE)
readfile2 <- read.dna("/Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus2", format="fasta")
distance_TN932 <- dist.dna(readfile2, model="TN93")
bins=seq(0,0.05,by=0.001)
hist(distance_TN932, breaks=bins, ylim="", main="", xlab="", ylab="", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01))
lines(density(distance_TN932), col=rgb(1,0,0,0.01))
par(new=TRUE)
....... on last file
但我认为这将是很多工作,这对于100个文件来说很好,但是如果其他拥有1,000个文件的人(例如,使用GenBank数据等工作的人)这可能太多了。 / p>
我还尝试了另一种方法,使用一些Unix将不同的文件粘贴到\ t分隔的列列表中:
paste /Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus* /Volumes/ALEX_R-HD/xii-27_D_coccus/blank > /Volumes/ALEX_R-HD/xii-27_D_coccus/blank
文件看起来像这样,并且“我们”试图明确它们是如何分开的
>name1 "\t" >name1 "\t" >name1 ...... 100 times for each row
actgactg "\t" actgaca "\t" actgaca
actgttgc "\t" actgact "\t" actgaca
>name2 "\t" >name2 "\t" >name2
actgactg "\t" actgaca "\t" actgaca
actgttgc "\t" actgact "\t" actgaca
但是我无法弄清楚如何让read.dna将每一列作为一个单独的数据矩阵读取,我可以读取文件读取。但是卡在那里,
由于我是一名新的R用户,我完全被这个难以接受了,我已经做了大量的在线寻找解决方案,并且似乎没有一个我发现它没有'如上所述,或许格子可以完成这项工作的一些变法: