Question

如何读取fasta文件（~4 Gb）并在长度为4 bps的窗口中计算核苷酸频率？

使用

读取fasta文件需要很长时间

library(ShortRead)
readFasta('myfile.fa')

我尝试使用（并且有许多）

来索引它

library(Rsamtools)
indexFa('myfile.fa')
fa = FaFile('myfile.fa')

但是我不知道如何以这种格式访问该文件

Answer 1

我猜想在一个文件中读取的“慢”是一分钟;比这更长的时间和软件以外的东西是问题。也许在处理之前询问文件的来源，操作系统以及是否操作文件（例如，尝试在文本编辑器中打开它们）是合适的。

如果'太慢'是因为你的内存不足，那么读取数据块可能有所帮助。使用Rsamtools

fa = "my.fasta"
## indexFa(fa) if the index does not already exist
idx = scanFaIndex(fa)

创建索引块，例如，创建n = 10块

chunks = snow::splitIndices(length(idx), 10)

然后处理文件

res = lapply(chunks, function(chunk, fa, idx) {
    dna = scanFa(fa, idx[chunk])
    ## ...
}, fa, idx)

使用do.call(c, res)或类似内容连接最终结果，或者如果您正在累积单个值，则可能使用for循环。索引fasta文件是通过调用samtools库;在非Windows上使用命令行上的samtools也是一个选项。

另一种方法是使用Biostrings::fasta.index()索引文件，然后使用

进行整理

idx = fasta.index(fa, seqtype="DNA")
chunks = snow::splitIndices(nrow(fai), 10)
res = lapply(chunks, function(chunk) {
    dna = readDNAStringSet(idx[chunk, ])
    ## ...
}, idx)

如果每条记录由一行DNA序列组成，那么通过readLines()读取（偶数编号）块中的记录，并从中进行处理相对容易

con = file(fa)
open(fa)
chunkSize = 10000000
while (TRUE) {
    lines = readLines(fa, chunkSize)
    if (length(lines) == 0)
        break
    dna = DNAStringSet(lines[c(FALSE, TRUE)])
    ## ...
}
close(fa)

Answer 2

加载readDNAStringSet()包，然后使用example("readDNAStringSet")方法

来自library(Biostrings) # example("readDNAStringSet") #optional filepath1 <- system.file("extdata", "someORF.fa", package="Biostrings") head(fasta.seqlengths(filepath1, seqtype="DNA")) # x1 <- readDNAStringSet(filepath1) head(x1)，略有修改：

input.myClass.disabled {
    pointer-events : none;
}

有效地读取fasta文件并计算R中的核苷酸频率

2 个答案: