Question

我正在尝试使用python更快地gzip文件，因为我的一些文件小到30 MB，大到4 GB。

创建gzip文件的方法是否比以下更有效？有没有办法优化以下内容，以便如果文件足够小，可以放在内存中，它只是读取要读取的文件的整个块而不是按行进行？

with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        f_out.writelines(f_in)

Answer 1

使用shutil.copyfileobj（）函数以较大的块复制文件。在这个例子中，我使用了16Meg块，这是非常合理的。

MEG = 2**20
with open(j, 'rb') as f_in:
    with gzip.open(j + ".gz", 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out, length=16*MEG)

您可能会发现，对于大型文件，调用gzip会更快，特别是如果您打算并行压缩多个文件。

Answer 2

您可以一次阅读，而不是逐行阅读。例如：

import gzip
with open(j, 'rb') as f_in:
    content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()

Answer 3

为下面的阅读 gzip文件找到两个几乎相同的方法：

A。）将所有内容加载到内存中 - ＆gt;对于非常大的文件（几GB）来说可能是一个糟糕的选择，因为你的内存不足
B。）不要将所有内容加载到内存中，逐行 - ＆gt; 适用于BIG文件

改编自 https://codebright.wordpress.com/2011/03/25/139/ 和 https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/ http://pastebin.com/dcEJRs1i

import sys
if sys.version.startswith("3"):
    import io
    io_method = io.BytesIO
else:
    import cStringIO
    io_method = cStringIO.StringIO

一个。）

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    fh = io_method(ph.communicate()[0])
    for line in fh:
        yield line

B中。）

def yield_line_gz_file(fn):
    """
    :param fn: String (absolute path)
    :return: GeneratorFunction (yields String)
    """
    ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
    for line in ph.stdout:
        yield line

使用Python更快地gzip文件？

3 个答案: