Question

我正在尝试压缩 S3 中文件的文件夹（前缀），该文件夹（前缀）总共大于 200GB。我发现 smart_open 是一个非常有前途的包，但在阅读“How To”维基后，我仍然需要更多关于如何使用它的指导。

我的问题的详细信息：

我有一个包含 S3 文件的文件夹，例如：

--S3_Bucket_1/file_folder
  -- folder1/file1.parquet
  -- folder2/file2.parquet
  -- text_file1.txt
  -- jar_file1.jar
...

我想以流式方式压缩 file_folder (>200GB) 中的所有文件，并将压缩文件保存到另一个 S3 位置，即 s3://bucket2/123/output.gz

我发现 smart_open 提供的功能非常接近我的需要。使用下面的代码，我能够成功地将 file_folder 下的所有文件压缩为一个 单个文件，并使用 output.gz ：

import os
import boto3
import smart_open
from io import RawIOBase
from smart_open import s3

session = boto3.Session()
source_bucket_name = "S3_Bucket_1" # the source bucket contains the file I want to compress
bucket = session.resource('s3').Bucket(bucket_name)
prefix = "file_folder" # s3 prefix for the files under a "folder"

output_path = "s3://bucket2/123/output.gz" # output where I want to save the compressed files

with smart_open.open(output_path, 'wb') as fout:
    for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
        fout.write(content)

但是，这将不起作用，因为所有文件都混合到压缩文件中的单个文件中。我想保持原来的文件结构。

我还尝试在 Python 中使用 ZipFile 压缩文件：

source_bucket_name = "S3_Bucket_1" # the source bucket contains the file I want to compress
prefix = "file_folder" # s3 prefix for the files under a "folder"
output_path = "s3://bucket2/123/output.gz" # output where I want to save the compressed files

with tempfile.NamedTemporaryFile() as tmp:
    tp = {'writebuffer': tmp}
    with open(s3_zip_path, 'wb', transport_params=tp) as fout:
        with zipfile.ZipFile(fout, 'w') as zipper:
            for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
                zipper.writestr(key, content)

这给了我错误：

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-e61f4a7d1257> in <module>
     19         with zipfile.ZipFile(fout, 'w') as zipper:
     20             for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
---> 21                 zipper.writestr(key, content)

~/anaconda3/lib/python3.8/zipfile.py in writestr(self, zinfo_or_arcname, data, compress_type, compresslevel)
   1815         with self._lock:
   1816             with self.open(zinfo, mode='w') as dest:
-> 1817                 dest.write(data)
   1818 
   1819     def __del__(self):

~/anaconda3/lib/python3.8/zipfile.py in close(self)
   1178                 # Preserve current position in file
   1179                 self._zipfile.start_dir = self._fileobj.tell()
-> 1180                 self._fileobj.seek(self._zinfo.header_offset)
   1181                 self._fileobj.write(self._zinfo.FileHeader(self._zip64))
   1182                 self._fileobj.seek(self._zipfile.start_dir)

~/anaconda3/lib/python3.8/gzip.py in seek(self, offset, whence)
    374                     raise ValueError('Seek from end not supported')
    375             if offset < self.offset:
--> 376                 raise OSError('Negative seek in write mode')
    377             count = offset - self.offset
    378             chunk = b'\0' * 1024

OSError: Negative seek in write mode

有没有人做过类似的事情，比如在 Python 中以流式模式压缩文件？

流式压缩 S3 文件的“文件夹”并将压缩文件存储到另一个 S3 位置

我的问题的详细信息：

0 个答案: