Gzip从一个s3存储桶解压缩到另一个

时间:2020-04-01 08:26:18

标签: python-3.x amazon-web-services amazon-s3 multiprocessing boto3

我在s3上的文件夹中有一堆gz文件。我正在尝试并行解压缩文件,并将解压缩的gz内容转储到另一个s3存储桶中。

我正在尝试在python中使用多进程,这就是我尝试过的

我具有以下工作职能

def worker_function(objnm,source_bucket,source_folder_location,target_bucket,target_folder_location) :
    filename = objnm.split('/')[-1].replace("GZ.001","txt")
    print("target_file :---" + filename)
    target=target_folder_location + filename
    print("file copying to destination :-- " + target)
    boto3.client('s3').upload_fileobj(Fileobj=gzip.GzipFile(None, 'rb', fileobj=BytesIO(client.get_object(Bucket=source_bucket, Key=key)['Body'].read())), Bucket=target_bucket, Key=target)

此功能将gz文件解压缩并将其写入s3存储桶。然后我有下面的if块

if __name__ == "__main__":

    client=boto3.client('s3') 

    source_bucket=sys.argv[1]
    source_folder_location=sys.argv[2]
    target_bucket=sys.argv[3]
    target_folder_location=sys.argv[4]

    objlist = client.list_objects(Bucket=source_bucket,Prefix=source_folder_location)['Contents']
    for object_summary in objlist:
        key = object_summary["Key"]
        print(key)

        if (".GZ" in key):
            p = multiprocessing.Process(target=worker_function, args=(key, source_bucket, source_folder_location, target_bucket, target_folder_location))
            p.start()

但是,如果我运行此代码,则会收到以下错误

Process Process-16:
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File ".py", line 31, in worker_function
    boto3.client('s3').upload_fileobj(Fileobj=gzip.GzipFile(None, 'rb', fileobj=BytesIO(client.get_object(Bucket=source_bucket, Key=key)['Body'].read())), Bucket=target_bucket, Key=target)
  File "/usr/local/lib/python3.6/site-packages/botocore/response.py", line 78, in read
    chunk = self._raw_stream.read(amt)
  File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 503, in read
    data = self._fp.read() if not fp_closed else b""
  File "/usr/lib64/python3.6/http/client.py", line 472, in read
    s = self._safe_read(self.length)
  File "/usr/lib64/python3.6/http/client.py", line 627, in _safe_read
    return b"".join(s)
MemoryError

我正试图了解此错误的含义以及如何解决。我正在具有95个核心的巨大节点上运行此代码。有人可以帮我吗。谢谢

0 个答案:

没有答案