Question

虽然知道我的问题一般有很多提供的解决方案，我仍然不满意他们在特殊案例中所需的运行时。

以FASTA格式考虑 35G 大文本文件，如下所示：

>Protein_1 So nice and cute little fella
MTTKKCLQKFHLESLGKLGDSFLKYAISIQLFKSYENHYEGLPSIKKNKIISNAALFKLG 
YARKILRFIRNEPFDLKVGLIPSDNSQAYNFGKEFLMPSVKMCSRVK*
>Protein_2 Fancy incredible description of its function
MADDSKFCFFLVSTFLLLAVVVNVTLAANYVPGDDILLNCGGPDNLPDADGRKWGTDIGS
[…] etc.

我需要仅提取>行。

使用grep '>' proteins.fasta > protein_descriptions.txt实现此目标只需几分钟。

但是使用Java 7现在已经运行了90多分钟：

public static void main(String[] args) throws Exception {
    BufferedReader fastaIn = new BufferedReader(new FileReader(args[0]));
    List<String> l = new ArrayList<String>();
    String str;
    while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            l.append(str);
        }
    }
    fastaIn.close();
    // …
}

有没有人知道如何将其加速到grep性能？

非常感谢您的帮助。干杯！

Answer 1

如果你立即将它写入outfile而不是在内存中累积对象，它将提高性能（并且更像是你用grep做的事情）。

...
BufferedWriter fastaOut = new BufferedWriter(new FileWriter(args[1]));
...
while ((str = fastaIn.readLine()) != null) {
        if (str.startsWith(">")) {
            fastaOut.write(str);
            fastaOut.newLine();
        }
    }
...    
fastaOut.close();

Answer 2

biojava.org提供了一个fasta读者。要读取大文件，您必须考虑使用SeekableByteChannell并使用ByteBuffers。 biojava库使用bytebuffers。

Answer 3

使用多个线程可能会大大提高速度。如果文件长度为X字节，并且您有n个线程，则以X / n间隔启动每个线程，并读取X / n字节。您需要同步ArrayList以确保正确添加结果

Java：如何快速从大文本文件中提取匹配行？

3 个答案: