Question

我有一个用Python编写的巨大管道，它使用非常大的.gz文件（约14GB压缩），但需要更好的方法将某些行发送到外部软件（formatdb from blast-legacy/2.2.26）。我有很长一段时间以前为我写过的Perl脚本这样做非常快，但我需要在Python中做同样的事情，因为管道的其余部分是用Python编写的，我必须保持这种方式。 Perl脚本使用两个文件句柄，一个用于保存.gz文件的zcat，另一个用于存储软件所需的行（每4个中有2个）并将其用作输入。它涉及生物信息学，但不需要经验。该文件采用fastq格式，软件需要采用fasta格式。每4行是一个fastq记录，取第1行和第3行并添加＆＃39;＆gt;＆＃39;到第1行的开头，这就是formatdb软件将用于每个记录的fasta等效。

perl脚本如下：

#!/usr/bin/perl 
my $SRG = $ARGV[0]; # reads.fastq.gz

open($fh, sprintf("zcat %s |", $SRG)) or die "Broken gunzip $!\n";

# -i: input -n: db name -p: program 
open ($fh2, "| formatdb -i stdin -n $SRG -p F") or die "no piping formatdb!, $!\n";

#Fastq => Fasta sub
my $localcounter = 0;
while (my $line = <$fh>){
        if ($. % 4==1){
                print $fh2 "\>" . substr($line, 1);
                $localcounter++;
        }
        elsif ($localcounter == 1){
                print $fh2 "$line";
                $localcounter = 0;
        }
        else{
        }
}
close $fh;
close $fh2;
exit;

效果很好。我怎么能在Python中做同样的事情？我喜欢Perl如何使用这些文件句柄，但我不确定如何在不创建实际文件的情况下在Python中执行此操作。我能想到的只是gzip.open文件，并将我需要的每条记录的两行写入一个新文件，并使用＆＃34; formatdb＆＃34;，但它太慢了。有任何想法吗？我需要将它工作到python管道中，所以我不能仅仅依赖于perl脚本，而且我也想知道如何一般地执行此操作。我假设我需要使用某种形式的子进程模块。

这是我的Python代码，但同样是慢速和速度是这里的问题（巨大的文件）：

#!/usr/bin/env python

import gzip
from Bio import SeqIO # can recognize fasta/fastq records
import subprocess as sp
import os,sys

filename = sys.argv[1] # reads.fastq.gz

tempFile = filename + ".temp.fasta"

outFile = open(tempFile, "w")

handle = gzip.open(filename, "r")
# parses each fastq record
# r.id and r.seq are the 1st and 3rd lines of each record
for r in SeqIO.parse(handle, "fastq"):
    outFile.write(">" + str(r.id) + "\n")
    outFile.write(str(r.seq) + "\n")

handle.close()
outFile.close()

    cmd = 'formatdb -i ' + str(tempFile) + ' -n ' + filename + ' -p F '
    sp.call(cmd, shell=True)

    cmd = 'rm ' + tempFile
    sp.call(cmd, shell=True)

Answer 1

首先，在Perl和Python中都有一个更好的解决方案：只需使用gzip库。在Python中，有一个in the stdlib;在Perl中，您可以在CPAN上找到一个。例如：

with gzip.open(path, 'r', encoding='utf-8') as f:
    for line in f:
        do_stuff(line)

比炮轰zcat更简单，更高效，更便携。

但是，如果您确实想要在Python中启动子流程并控制其管道，则可以使用subprocess模块执行此操作。而且，与perl不同，Python可以做到这一点，而不必在中间粘贴一个shell。在Replacing Older Functions with the subprocess Module上的文档中甚至还有一个很好的部分可以为您提供食谱。

所以：

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)

现在，zcat.stdout是一个类文件对象，使用通常的read方法等，将管道包装到zcat子进程。

因此，例如，在Python 3.x中一次读取一个二进制文件8K：

zcat = subprocess.Popen(['zcat', path], stdout=subprocess.PIPE)
for chunk in iter(functools.partial(zcat.stdout.read, 8192), b''):
    do_stuff(chunk)
zcat.wait()

（如果你想在Python 2.x中执行此操作，或者一次读取一行文本文件而不是一次读取8K二进制文件，或者其他任何内容，则更改与它们的相同任何其他文件处理编码。）

Answer 2

您可以使用此函数解析整个文件并将其作为行列表加载：

    def convert_gz_to_list_of_lines(filepath):
     """Parse gz file and convert it into a list of lines."""
     file_as_list = list()
     with gzip.open(filepath, 'rt', encoding='utf-8') as f:
      try:
       for line in f:
        file_as_list.append(line)
      except EOFError:
        file_as_list = file_as_list
      return file_as_list

Python相当于管道zcat结果到Perl中的文件句柄

2 个答案: