Question

有类似的问题（和回答），但从来没有真正在一起，我似乎无法得到任何工作。因为我刚开始使用Python，所以很容易理解的东西会很棒！

我有3个大数据文件（＆gt; 500G），我需要解压缩，连接，将其传递给子进程，然后将该输出传递给另一个子进程。然后我需要处理我想用Python做的最终输出。注意我不需要解压缩和/或连接文件，除了处理 - 创建一个我认为会浪费空间。这是我到目前为止所拥有的......

import gzip
from subprocess import Popen, PIPE

#zipped files
zipfile1 = "./file_1.txt.gz"   
zipfile2 = "./file_2.txt.gz"  
zipfile3 = "./file_3.txt.gz"


# Open the first pipe
p1 = Popen(["dataclean.pl"], stdin=PIPE, stdout=PIPE)

# Unzip the files and pipe them in (has to be a more pythonic way to do it - 
# if this is even correct)
unzipfile1 = gzip.open(zipfile1, 'wb')
p1.stdin.write(unzipfile1.read())
unzipfile1.close()

unzipfile2 = gzip.open(zipfile2, 'wb')
p1.stdin.write(unzipfile2.read())
unzipfile2.close()

unzipfile3 = gzip.open(zipfile3, 'wb')
p1.stdin.write(unzipfile3.read())
unzipfile3.close()


# Pipe the output of p1 to p2
p2 = Popen(["dataprocess.pl"], stdin=p1.stdout, stdout=PIPE)

# Not sure what this does - something about a SIGPIPE
p1.stdout.close()

## Not sure what this does either - but it is in the pydoc
output = p2.communicate()[0]

## more processing of p2.stdout...
print p2.stdout

任何建议都将不胜感激。 *作为奖励问题...... read（）的pydoc说明了这一点：

“另请注意，在非阻止模式下，即使没有给出大小参数，也可能返回的数据少于请求的数据。”

这看起来很吓人。任何人都能解读吗？我不想仅仅读入数据集的一部分，认为这是整个事情。我认为离开文件的大小是件好事，特别是当我不知道文件的大小时。

谢谢，

GK

Answer 1

首先要做的事情;我认为你的模式不正确：

unzipfile1 = gzip.open(zipfile1, 'wb')

这should open zipfile1 for writing，而非阅读。我希望你的数据仍然存在。

其次，您不想尝试一次使用整个数据。您应该使用16k或32k或更大的块来处理数据。（最佳尺寸会因许多因素而异;如果必须多次执行此任务，请将其配置为可配置，以便您可以设置不同的尺寸。）

您正在寻找的可能更像是未经测试的伪代码：

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)

如果你试图用Python连接管道中的多个进程，那么它可能看起来更像这样：

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)
    p2.stdin.write(p1.stdout.read())

这样可以尽快将p1的输出提供给p2。我假设p1不会产生比给定的更多的输入。如果p1的输出将是输入的十倍，那么你应该进行另一个类似于此的循环。

但是，我必须说，复制shell脚本需要做很多额外的工作：

gzip -cd file1.gz file2.gz file3.gz | dataclean.py | dataprocess.pl

如上所述，

gzip(1)会自动处理数据块大小的数据传输，并假设您的dataclean.py和dataprocess.pl脚本也处理数据在块中而不是执行完整读取（正如此脚本的原始版本所做的那样），然后它们应该在最佳能力范围内并行运行。

连接大文件，管道和奖金

1 个答案: