我是AWS Lambda的专家,我正在编写一个应该能够解析多行CSV并将其编写为CSV单行的函数。在S3文件存放后触发Lambda脚本。 我已经编写了脚本,并且可以正确使用测试文件(某些KB)。
import boto3
import botocore
import csv
import os
import sys
import datetime
import uuid
field_number = 29
s3_client = boto3.client('s3')
def lambda_handler(event, context):
start=datetime.datetime.now()
for record in event['Records']:
# get the event time
event_time = record['eventTime']
# get the event name (e.g. filePUT, ...)
event_name = record['eventName']
# get the principal_id (i.e. the user who performed the action)
principal_id = record['userIdentity']['principalId']
# get the name of the bucket on which event is performed
bucket_name = record['s3']['bucket']['name']
# get the name of the object affected by the action
object_name = record['s3']['object']['key']
destination_path='test_multiline/'+object_name.split("/")[len(object_name.split("/"))-1]
# get the file from S3
try:
response = s3_client.get_object(Bucket=bucket_name,Key=object_name)
print('object correctly read from S3')
except:
print('Error in reading file from S3')
file_content = response['Body'].read().decode('utf-8')
file_content = file_content.replace('""','')
file_content = file_content.replace(',\n',',""\n')
while(',,' in file_content):
file_content = file_content.replace(',,',',"",')
# get the elements of the file separated by comma
file_content_csv = csv.reader(file_content,delimiter=",")
list=[]
csv_line=""
index=0
row_num=0
for element in file_content_csv:
# if this condition is met, it means a new row is just started
if len(element)==0:
csv_line = ""
index = 0
else:
# if this condition is met, it means that this is an element of
# the csv (not a comma)
if(len(element)==1):
# check if this is the last element of the row
if(index==field_number-1):
csv_line = csv_line +""+ str(element[0].replace(',',''))
csv_line = csv_line.replace('\n',' ')
list.append(csv_line)
row_num = row_num+1
else:
csv_line = csv_line +""+ str(element[0].replace(',',''))+","
csv_line = csv_line.replace('\n',' ')
index = index + 1
try:
with open("/tmp/local_output.csv", "w+") as outfile:
for entries in list:
outfile.write(entries)
outfile.write("\n")
print('/tmp/local_output.csv correctly written to local')
outfile.close()
except IOError:
print('Error in writing file in local')
# upload the new file to S3
try:
s3_client.upload_file('/tmp/local_output.csv', 'multiline', destination_path)
print('test_multiline/s3_output.csv correctly written to S3')
except:
print('Error in writing file to S3')
# get time lamdba function stop
stop=datetime.datetime.now()
如前所述,该脚本可以正确处理某些KB文件。 但是,我生产中的文件约为800MB,当我将其上传到S3时,出现此错误:
REPORT RequestId: e8c6103f-1287-11e9-a1cf-8fcf787319ca Duration: 9117.51 ms Billed Duration: 9200 ms Memory Size: 3008 MB Max Memory Used: 3008 MB
您可能会看到,我已经将MaxMemory增加到3008MB,执行时间增加到900s(最大时间)。
然后,我尝试将800MB文件分成8个100MB的块。当我尝试将8个文件上传到S3时,第一个文件的计算效果很好,但是从第二个文件开始,我得到了上面突出显示的问题。
您能帮我解决此问题吗?我当时想通过将文件分成较小的块可以解决此问题。
答案 0 :(得分:0)
您需要使用SNS或SQS甚至是Step之类的东西,以便您为每个文件触发一个lambda,而不是尝试在单个调用中执行所有文件。
lambda并不是为处理长时间运行的大型事务而设计的,而是为处理对小数据执行事务的小代码而设计的。