我有一个R脚本,我可以在映射后读取.sam文件中的行,我想将sam文件的行解析为字符串,以便更容易操作它们并创建我想要的假发文件或计算我需要的cov3和cov5。 能帮助我,让这个脚本更快地运行吗?如何更快地将巨大的.sam文件的行解析到数据框中?这是我的剧本:
gc()
rm(list=ls())
exptPath <- "/home/dimitris/INDEX3PerfectUnique31cov5.sam"
lines <- readLines(exptPath)
pos = lines
pos
chrom = lines
chrom
pos = ""
chrom = ""
nn = length(lines)
nn
# parse lines of sam file into strings(this part is very very slow)
rr = strsplit(lines,"\t", fixed = TRUE)
rr
trr = do.call(rbind.data.frame, rr)
pos = as.numeric(as.character(trr[8:nn,4]))
# for cov3
#pos = pos+25
#pos
chrom = trr[8:nn,3]
pos = as.numeric(pos)
pos
tab1 = table(chrom,pos, exclude="")
tab1
ftab1 = as.data.frame(tab1)
ftab1 = subset(ftab1, ftab1[3] != 0)
ftab1 = subset(ftab1, ftab1[1] != "<NA>")
oftab1 = ftab1[ order(ftab1[,1]), ]
final.ftab1 = oftab1[,2:3]
write.table(final.ftab1, "ind3_cov5_wig.txt", row.names=FALSE,
sep=" ", quote=FALSE)
答案 0 :(得分:1)
如果无法访问样本输入和输出(例如,Dropbox上数据的子集),很难提供详细的答案。 Bioconductor解决方案会将sam文件转换为bam
library(Rsamtools)
bam <- "/path/to/new.bam")
asBam("/path/to/old.sam", bam)
然后直接读取数据(请参阅?scanBam
和?ScanBamParam
以仅导入感兴趣的字段/区域)
rr <- scanBam(bam)
或最后更方便
library(GenomicAlignments)
aln <- readGAlignments(bam)
## maybe cvg <- coverage(bam) ?
执行操作会有几个步骤,以GRanges
对象结尾(有点像data.frame,但行有基因组坐标)或相关对象
## ...???
## gr <- GRanges(seqnames, IRanges(start, end), strand=..., score=...)
最终目标是使用
导出到假发/ bigWig /床文件library(rtracklayer)
export(gr, "/path/to.wig")
有大量的帮助资源,包括包装插图,手册页和Bioconductor mailing list