有没有办法用openhook = hook_compressed performant制作fileinput.input()?

时间:2014-09-23 01:39:12

标签: python performance file-io gzip

fileinput.input似乎至少是zcat的两倍,即使使用缓冲设置也是如此。 问题:如果没有编写一堆代码,我能做些什么才能使其具有高性能?我所做的测试是从urandom中获取数据,

"""generate.py"""
import base64
with open('/dev/urandom', 'rb') as f:
    for _ in xrange(102400):
        print(base64.b64encode(f.read(1024)))

运行它并通过gzip管道输出,

> python generate.py | gzip - > test_input.gz

zcat时间

> time zcat test_input.gz > /dev/null
zcat test_input.gz > /dev/null  1.56s user 0.02s system 99% cpu 1.576 total

fileinput time

> time python -c 'import fileinput; list(fileinput.input(files=["test_input.gz"], openhook=fileinput.hook_compressed))'
python -c   3.13s user 0.16s system 99% cpu 3.293 total

这不只是fileinput.input()慢,因为从stdin读取它很好,

> time zcat test_input.gz | python -c 'import fileinput; list(fileinput.input())'
zcat test_input.gz  1.64s user 0.04s system 96% cpu 1.736 total
python -c 'import fileinput; list(fileinput.input())'  0.39s user 0.17s system 31% cpu 1.800 total

我和bufsize=搞砸了,但没有运气。

写了很多代码

我围着谷歌思考gzip本身很慢,发现如果我做一些手动缓冲它很好,

"""read_buffered_manual.py"""
import gzip

def input_buffered_manual(filename, buf_size=32 * 1024):
    fd = gzip.open(filename)
    try:
        remaining = ''
        while True:
            input_ = fd.read(buf_size)
            if not input_:
                if remaining:
                    yield remaining
                return

            lines = input_.split('\n')
            lines[0] = remaining + lines[0]
            remaining = lines.pop()
            for line in lines:
                yield line
    finally:
        fd.close()

for line in input_buffered_manual("test_input.gz"):
    print line

这很快,实际上甚至比zcat还快,

> time python read_buffered_manual.py > /dev/null
python read_buffered_manual.py > /dev/null  1.40s user 0.04s system 99% cpu 1.461 total

1 个答案:

答案 0 :(得分:1)

好吧,你可以使用专门的工具:

import gzip
import sys
import shutil

for filename in ["test_input.gz"]:
    with gzip.open(filename) as file:
        shutil.copyfileobj(file, sys.stdout)

那很快。