AWS Glue Python Shell作业失败,并出现MemoryError

时间:2020-04-29 15:45:13

标签: python-3.x amazon-web-services pyspark aws-glue pyarrow

我有一个运行了约一分钟的AWS Glue Python Shell作业失败,处理了2 GB的文本文件。作业会对文件进行少量编辑,例如查找和删除某些行,删除一行中的最后一个字符以及根据条件添加回车符。 对于小于1 GB的文件大小,同样的作业可以很好地运行。

  • 作业“最大容量设置”为1。
  • “最大并发”是2880。
  • “作业超时(分钟)”是900。

详细的失败消息:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-9g022ft7/pysh-tf-bb-to-parquet.py", line 134, in <module>
MemoryError

我要运行的实际python代码:

import boto3
import json
import os
import sys
from sys import getsizeof
import datetime
from datetime import datetime
import os
import psutil
import io 
import pandas as pd 
import pyarrow as pa #not supported by glue
import pyarrow.parquet as pq #not supported by glue
import s3fs #not supported by glue

#Object parameters (input and output).
s3region = 'reducted' 
s3bucket_nm = 'reducted' 

#s3 inbound object parameters.
s3object_inbound_key_only = 'reducted' 
s3object_inbound_folder_only = 'reducted' 
s3object_inbound_key = s3object_inbound_folder_only + '/' + s3object_inbound_key_only 

#s3 object base folder parameter.
s3object_base_folder = s3object_inbound_key_only[:-9].replace('.', '_')

#s3 raw object parameters.
s3object_raw_key_only = s3object_inbound_key_only
s3object_raw_folder_only = 'reducted' + s3object_base_folder
s3object_raw_key = s3object_raw_folder_only + '/' + s3object_inbound_key_only

#s3 PSV object parameters.
s3object_psv_key_only = s3object_inbound_key_only + '.psv'
s3object_psv_folder_only = 'reducted' + s3object_base_folder + '_psv'
s3object_psv_key = s3object_psv_folder_only + '/' + s3object_psv_key_only
s3object_psv_crawler = s3object_base_folder + '_psv'

glue_role = 'reducted'

processed_immut_db = 'reducted'

#Instantiate s3 client.
s3client = boto3.client(
    's3',
    region_name = s3region
)

#Instantiate s3 resource.
s3resource = boto3.resource(
    's3',
    region_name = s3region
)

#Store raw object metadata as a dictionary variable.
s3object_raw_dict = {
    'Bucket': s3bucket_nm,
    'Key': s3object_inbound_key
}

#Create raw file object.
s3object_i = s3client.get_object(
    Bucket = s3bucket_nm,
    Key = s3object_raw_folder_only + '/' + s3object_raw_key_only
)

#Initialize the list to hold the raw file data string.
l_data = []

#Load s_data string into a list and transform.
for line in (''.join((s3object_i['Body'].read()).decode('utf-8'))).splitlines():
    #Once the line with the beginning of the field list tag is reached, re-initialize the list.
    if line.startswith('START-OF-FIELDS'):
        l_data = []
    #Load (append) the input file into the list.
    l_data.append(line + '\n')
    #Once the line with the end of the field list tag is reached, remove the field metadata tags.
    if line.startswith('END-OF-FIELDS'):
    #Remove the blank lines.
        l_data=[line for line in l_data if '\n' != line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the field list.
        l_data.remove('START-OF-FIELDS\n')
        l_data.remove('END-OF-FIELDS\n')
        #Remove the new line characters (\n) from each field name (assuming the last character in each element).
        l_data=list(map(lambda i: i[:-1], l_data))
        #Insert "missing" field names in the beginning of the header.
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        #Add | delimiter to each field.
        l_data=[each + "|" for each in l_data]
        #Concatenate all header elements into a single element.
        l_data = [''.join(l_data[:])]
    #Once the line with the end of data dataset tag is reached, remove the dataset metadata tags.
    if line.startswith('END-OF-FILE'):
        #Remove TIMESTARTED metadata.
        l_data=[line for line in l_data if 'TIMESTARTED' not in line]
        #Remove lines with #.
        l_data=[line for line in l_data if '#' not in line]
        #Remove the tags signifying the the start and end of the dataset.
        l_data.remove('START-OF-DATA\n')
        l_data.remove('END-OF-DATA\n')
        #Remove DATARECORDS metadata.
        l_data=[line for line in l_data if 'DATARECORDS' not in line]
        #Remove TIMEFINISHED metadata.
        l_data=[line for line in l_data if 'TIMEFINISHED' not in line]
        #Remove END-OF-FILE metadata.
        l_data=[line for line in l_data if 'END-OF-FILE' not in line]

#Store the file header into a variable.
l_data_header=l_data[0][:-1] + '\n'

#Add the column with the name of the inbound file to all elements of the file body.
l_data_body=[s3object_inbound_key_only[-8:] + '|' + line[:-2] + '\n' for line in l_data[2:]]

#Combine the file header and file body into a single list.
l_data_body.insert(0, l_data_header)

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

#Write the transformed list from a string variable to a new s3 object.
s3resource.Object(s3bucket_nm, s3object_psv_folder_only + '/' + s3object_psv_key_only).put(Body=s3object_o_data)

我确定“ MemoryError”是由下面的代码行引起的。 s3object_i_data_decoded包含我之前提到的2 GB文件。在执行这行代码之前,python进程占用的总内存为2.025 GB。在以下代码行运行之后,似乎内存使用量急剧增加:

#Load the transformed list into a string variable.
s3object_o_data = ''.join(l_data_body)

在代码运行期间测量了进程的内存大小后,我发现,只要将列表变量加载到另一个变量中,使用的内存量几乎增加了四倍。因此,将2 GB列表变量分配给另一个变量后,导致进程将内存大小增加到6+ GB。 :/

我还假设Glue Python Shell Jobs难以处理超过2GB大小范围的文件...任何人都可以确认吗?

  1. 其他人在处理大于2 GB的文件时是否遇到此错误?
  2. 是否可以对作业进行任何调整以避免这种“ MemoryError”?
  3. 2 GB的数据集对于Glue Python Shell Job来说太大了,也许应该考虑使用Glue Spark。

从理论上讲,我可以通过代码本身将作业划分为较小的批处理,但我想看看是否有较低的挂果。

如果不需要,我真的很想调整现有工作,并避免使用Glue Spark。

在此先感谢大家分享他们的想法! :)

1 个答案:

答案 0 :(得分:1)

如果您可以显示代码片段,那将是很好的选择。 1个DPU为您提供 4个核心 16 GB记忆体 足以处理您的数据。

最好的办法是将文件读取为StreamingBody,然后分块执行操作。 您可以引用它here

基本上,最好利用s3的流式传输功能。

如果共享2GB文件的读取和写入方式,那么在这里没什么大不了的。

我有多个建议,如果您愿意,可以实施它们: 1.在处理文件时,不要将整个文件逐行读取到内存中。

for line in s3object_i['Body'].iter_lines():
  1. 您一次又一次地使用列表理解只是为了过滤数据,而您可以创建一个复合语句,因为这会增加代码的时间复杂度并像
  2. 那样进行优化
    if line.startswith('END-OF-FIELDS'):
        l_data.insert(0, 'BB_FILE_DT')
        l_data.insert(1, 'BB_ID')
        l_data.insert(2, 'RETURN_CD')
        l_data.insert(3, 'NO_OF_FIELDS')
        l_data=[line + "|" for line in l_data if ('' != line) and ('#' not in line)]
        l_data.remove('START-OF-FIELDS')
        l_data.remove('END-OF-FIELDS')
        l_data = [''.join(l_data[:])]

#and
    if line.startswith('END-OF-FILE'):
        l_data.remove('START-OF-DATA')
        l_data.remove('END-OF-DATA')
        l_data=[line for line in l_data if ('TIMESTARTED' not in line) and ('#' not in line) and ('DATARECORDS' not in line) and ('TIMEFINISHED' not in line) and ('END-OF-FILE' not in line)]
  1. 为了将文件保存回s3,可以利用分段上传,也可以创建一个生成器对象而不是list,然后将结果生成到s3。 喜欢
def uploadFileS3():
#for uploading 25 mb chunks to s3
    config = TransferConfig(multipart_threshold=1024*25, max_concurrency=10,
                        multipart_chunksize=1024*25, use_threads=True)

    s3_client.upload_file(file, S3_BUCKET, key,
    Config = config,
    Callback=ProgressPercentage(''.join(l_data))
    )


------------------------------------------------------------
#or a bit tricky to implement but worth it
------------------------------------------------------------
def file_stream():
    for line in l_data:
        yield line

# we have to keep track of all of our parts
part_info_dict = {'Parts': []}
# start the multipart_upload process
multi_part_upload = s3.create_multipart_upload(Bucket=bucket_name, Key=temp_key)

# Part Indexes are required to start at 1
for part_index, line in enumerate(file_stream(), start=1):
    # store the return value from s3.upload_part for later
    part = s3.upload_part(
        Bucket=bucket_name,
        Key=temp_key,
        # PartNumber's need to be in order and unique
        PartNumber=part_index,
        # This 'UploadId' is part of the dict returned in multi_part_upload
        UploadId=multi_part_upload['UploadId'],
        # The chunk of the file we're streaming.
        Body=line,
    )

    # PartNumber and ETag are needed
    part_info_dict['Parts'].append({
        'PartNumber': part_index,
        # You can get this from the return of the uploaded part that we stored earlier
        'ETag': part['ETag']
    })

    # This what AWS needs to finish the multipart upload process
    completed_ctx = {
        'Bucket': bucket_name,
        'Key': temp_key,
        'UploadId': multi_part_upload['UploadId'],
        'MultipartUpload': part_info_dict
    }

# Complete the upload. This triggers Amazon S3 to rebuild the file for you.
# No need to manually unzip all of the parts ourselves!
s3.complete_multipart_upload(**completed_ctx)

如果您可以实施这些更改,则可以在python胶壳中甚至处理5GB的文件。关键是更好地优化代码。

希望您明白这一点。

谢谢。