我在s3上的文件夹中有一堆gz文件。我正在尝试并行解压缩文件,并将解压缩的gz内容转储到另一个s3存储桶中。
我正在尝试在python中使用多进程,这就是我尝试过的
我具有以下工作职能
def worker_function(objnm,source_bucket,source_folder_location,target_bucket,target_folder_location) :
filename = objnm.split('/')[-1].replace("GZ.001","txt")
print("target_file :---" + filename)
target=target_folder_location + filename
print("file copying to destination :-- " + target)
boto3.client('s3').upload_fileobj(Fileobj=gzip.GzipFile(None, 'rb', fileobj=BytesIO(client.get_object(Bucket=source_bucket, Key=key)['Body'].read())), Bucket=target_bucket, Key=target)
此功能将gz文件解压缩并将其写入s3存储桶。然后我有下面的if块
if __name__ == "__main__":
client=boto3.client('s3')
source_bucket=sys.argv[1]
source_folder_location=sys.argv[2]
target_bucket=sys.argv[3]
target_folder_location=sys.argv[4]
objlist = client.list_objects(Bucket=source_bucket,Prefix=source_folder_location)['Contents']
for object_summary in objlist:
key = object_summary["Key"]
print(key)
if (".GZ" in key):
p = multiprocessing.Process(target=worker_function, args=(key, source_bucket, source_folder_location, target_bucket, target_folder_location))
p.start()
但是,如果我运行此代码,则会收到以下错误
Process Process-16:
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib64/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File ".py", line 31, in worker_function
boto3.client('s3').upload_fileobj(Fileobj=gzip.GzipFile(None, 'rb', fileobj=BytesIO(client.get_object(Bucket=source_bucket, Key=key)['Body'].read())), Bucket=target_bucket, Key=target)
File "/usr/local/lib/python3.6/site-packages/botocore/response.py", line 78, in read
chunk = self._raw_stream.read(amt)
File "/usr/local/lib/python3.6/site-packages/urllib3/response.py", line 503, in read
data = self._fp.read() if not fp_closed else b""
File "/usr/lib64/python3.6/http/client.py", line 472, in read
s = self._safe_read(self.length)
File "/usr/lib64/python3.6/http/client.py", line 627, in _safe_read
return b"".join(s)
MemoryError
我正试图了解此错误的含义以及如何解决。我正在具有95个核心的巨大节点上运行此代码。有人可以帮我吗。谢谢