Question

我正在尝试在Bash中复制这个bash命令，该命令返回每个文件gzipped 50MB。

split -b 50m "file.dat.gz" "file.dat.gz.part-"

我对python等价的尝试

import gzip
infile = "file.dat.gz"
slice = 50*1024*1024 # 50MB
with gzip.open(infile, 'rb') as inf:
  for i, ch in enumerate(iter(lambda: inf.read(slice), "")):
    print(i, slice)
    with gzip.open('{}.part-{}'.format(infile[:-3], i), 'wb') as outp:
      outp.write(ch)

每次gzip返回15MB。当我压缩文件时，它们每个都是50MB。

如何在python中拆分gzip文件，以便在解压缩之前拆分文件各50MB？

Answer 1

我不相信split按照你的想法行事。它不会将gzip文件拆分为较小的gzip文件。即你不能在它创建的单个文件上调用gunzip。它实际上将数据拆分为较小的块，如果你想将其解压缩，则必须首先将所有块连接在一起。因此，为了模拟Python的实际行为，我们会做类似的事情：

infile_name = "file.dat.gz"

chunk = 50*1024*1024 # 50MB

with open(infile_name, 'rb') as infile:
    for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
        print(n, chunk)
        with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
            outfile.write(raw_bytes)

实际上我们会读取多个较小的输入块以使一个输出块使用更少的内存。

我们或许可以将文件分成较小的文件，我们可以单独进行gunzip，并且仍然可以制作目标大小。使用类似bytesIO流的内容，我们可以对文件进行gunzip并将其gzip到该内存流中，直到达到目标大小，然后将其写出并开始新的bytesIO流。

对于压缩数据，您必须测量输出的大小，而不是输入的大小，因为我们无法预测数据压缩的程度。

Answer 2

这是一种模拟split -l（分行）命令选项的解决方案，该选项将允许您使用gunzip打开每个文件。

import io
import os
import shutil
from xopen import xopen

def split(infile_name, num_lines ):
    
    infile_name_fp = infile_name.split('/')[-1].split('.')[0] #get first part of file name
    cur_dir = '/'.join(infile_name.split('/')[0:-1])
    out_dir = f'{cur_dir}/{infile_name_fp}_split'
    if os.path.exists(out_dir):
        shutil.rmtree(out_dir)
    os.makedirs(out_dir) #create in same folder as the original .csv.gz file
    
    m=0
    part=0
    buf=io.StringIO() #initialize buffer
    with xopen(infile_name, 'rt') as infile:
        for line in infile:
            if m<num_lines: #fill up buffer
                buf.write(line)
                m+=1
            else: #write buffer to file
                with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
                            outfile.write(buf.getvalue())
                m=0
                part+=1
                buf=io.StringIO() #flush buffer -> faster than seek(0); truncate(0);
        
        #write whatever is left in buffer to file
        with xopen(f'{out_dir}/{infile_name_fp}.part-{str(part).zfill(5)}.csv.gz', mode='wt', compresslevel=6) as outfile:
            outfile.write(buf.getvalue())
        buf.close()

用法：

split('path/to/myfile.csv.gz', num_lines=100000)

在path/to/myfile_split输出带有拆分文件的文件夹。

讨论：我在这里使用xopen来提高速度，但是如果您想使用Python本机软件包，可以选择使用gzip.open。在性能方面，我对它进行基准测试的时间大约是结合pigz和split的解决方案的两倍。不错，但可能会更好。瓶颈是for循环和缓冲区，因此也许将其重写为异步工作会更有效率。

在python中拆分相当于gzip的文件

2 个答案: