我正在尝试压缩 S3 中文件的文件夹(前缀),该文件夹(前缀)总共大于 200GB。我发现 smart_open 是一个非常有前途的包,但在阅读“How To”维基后,我仍然需要更多关于如何使用它的指导。
我有一个包含 S3 文件的文件夹,例如:
--S3_Bucket_1/file_folder
-- folder1/file1.parquet
-- folder2/file2.parquet
-- text_file1.txt
-- jar_file1.jar
...
我想以流式方式压缩 file_folder
(>200GB) 中的所有文件,并将压缩文件保存到另一个 S3 位置,即 s3://bucket2/123/output.gz
我发现 smart_open 提供的功能非常接近我的需要。使用下面的代码,我能够成功地将 file_folder
下的所有文件压缩为一个 单个文件,并使用 output.gz
:
import os
import boto3
import smart_open
from io import RawIOBase
from smart_open import s3
session = boto3.Session()
source_bucket_name = "S3_Bucket_1" # the source bucket contains the file I want to compress
bucket = session.resource('s3').Bucket(bucket_name)
prefix = "file_folder" # s3 prefix for the files under a "folder"
output_path = "s3://bucket2/123/output.gz" # output where I want to save the compressed files
with smart_open.open(output_path, 'wb') as fout:
for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
fout.write(content)
但是,这将不起作用,因为所有文件都混合到压缩文件中的单个文件中。我想保持原来的文件结构。
我还尝试在 Python 中使用 ZipFile
压缩文件:
source_bucket_name = "S3_Bucket_1" # the source bucket contains the file I want to compress
prefix = "file_folder" # s3 prefix for the files under a "folder"
output_path = "s3://bucket2/123/output.gz" # output where I want to save the compressed files
with tempfile.NamedTemporaryFile() as tmp:
tp = {'writebuffer': tmp}
with open(s3_zip_path, 'wb', transport_params=tp) as fout:
with zipfile.ZipFile(fout, 'w') as zipper:
for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
zipper.writestr(key, content)
这给了我错误:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-2-e61f4a7d1257> in <module>
19 with zipfile.ZipFile(fout, 'w') as zipper:
20 for key, content in s3.iter_bucket(source_bucket_name, prefix = prefix):
---> 21 zipper.writestr(key, content)
~/anaconda3/lib/python3.8/zipfile.py in writestr(self, zinfo_or_arcname, data, compress_type, compresslevel)
1815 with self._lock:
1816 with self.open(zinfo, mode='w') as dest:
-> 1817 dest.write(data)
1818
1819 def __del__(self):
~/anaconda3/lib/python3.8/zipfile.py in close(self)
1178 # Preserve current position in file
1179 self._zipfile.start_dir = self._fileobj.tell()
-> 1180 self._fileobj.seek(self._zinfo.header_offset)
1181 self._fileobj.write(self._zinfo.FileHeader(self._zip64))
1182 self._fileobj.seek(self._zipfile.start_dir)
~/anaconda3/lib/python3.8/gzip.py in seek(self, offset, whence)
374 raise ValueError('Seek from end not supported')
375 if offset < self.offset:
--> 376 raise OSError('Negative seek in write mode')
377 count = offset - self.offset
378 chunk = b'\0' * 1024
OSError: Negative seek in write mode
有没有人做过类似的事情,比如在 Python 中以流式模式压缩文件?