Question

我是AWS Lambda的专家，我正在编写一个应该能够解析多行CSV并将其编写为CSV单行的函数。在S3文件存放后触发Lambda脚本。我已经编写了脚本，并且可以正确使用测试文件（某些KB）。

import boto3
import botocore
import csv
import os
import sys
import datetime
import uuid

field_number = 29 

s3_client = boto3.client('s3')
def lambda_handler(event, context):   
    start=datetime.datetime.now()

    for record in event['Records']:

        # get the event time
        event_time = record['eventTime']
        # get the event name (e.g. filePUT, ...)
        event_name = record['eventName']
        # get the principal_id (i.e. the user who performed the action)
        principal_id = record['userIdentity']['principalId']
        # get the name of the bucket on which event is performed
        bucket_name = record['s3']['bucket']['name']
        # get the name of the object affected by the action
        object_name = record['s3']['object']['key']

        destination_path='test_multiline/'+object_name.split("/")[len(object_name.split("/"))-1]

        # get the file from S3
        try:
            response = s3_client.get_object(Bucket=bucket_name,Key=object_name)
            print('object correctly read from S3')
    except:
            print('Error in reading file from S3')
        file_content = response['Body'].read().decode('utf-8')
        file_content = file_content.replace('""','')
        file_content = file_content.replace(',\n',',""\n')
        while(',,' in file_content):
            file_content = file_content.replace(',,',',"",')

        # get the elements of the file separated by comma
        file_content_csv = csv.reader(file_content,delimiter=",")
        list=[]
        csv_line=""
        index=0
        row_num=0
        for element in file_content_csv:
            # if this condition is met, it means a new row is just started
            if len(element)==0:
                csv_line = ""
                index = 0
            else:
                # if this condition is met, it means that this is an element of 
                # the csv (not a comma)
                if(len(element)==1):
                    # check if this is the last element of the row
                    if(index==field_number-1):
                        csv_line = csv_line +""+ str(element[0].replace(',',''))
                        csv_line = csv_line.replace('\n',' ')
                        list.append(csv_line)
                        row_num = row_num+1
                    else:
                        csv_line = csv_line +""+ str(element[0].replace(',',''))+","
                        csv_line = csv_line.replace('\n',' ')
                        index = index + 1

        try:
            with open("/tmp/local_output.csv", "w+") as outfile:
                for entries in list:
                    outfile.write(entries)
                    outfile.write("\n")
            print('/tmp/local_output.csv correctly written to local')
            outfile.close()
        except IOError:
            print('Error in writing file in local')


        # upload the new file to S3
        try:
            s3_client.upload_file('/tmp/local_output.csv', 'multiline', destination_path)
            print('test_multiline/s3_output.csv correctly written to S3')
        except: 
            print('Error in writing file to S3')
    # get time lamdba function stop
    stop=datetime.datetime.now()

如前所述，该脚本可以正确处理某些KB文件。但是，我生产中的文件约为800MB，当我将其上传到S3时，出现此错误：

REPORT RequestId: e8c6103f-1287-11e9-a1cf-8fcf787319ca  Duration: 9117.51 ms    Billed Duration: 9200 ms Memory Size: 3008 MB   Max Memory Used: 3008 MB

您可能会看到，我已经将MaxMemory增加到3008MB，执行时间增加到900s（最大时间）。

然后，我尝试将800MB文件分成8个100MB的块。当我尝试将8个文件上传到S3时，第一个文件的计算效果很好，但是从第二个文件开始，我得到了上面突出显示的问题。

您能帮我解决此问题吗？我当时想通过将文件分成较小的块可以解决此问题。

Answer 1

您需要使用SNS或SQS甚至是Step之类的东西，以便您为每个文件触发一个lambda，而不是尝试在单个调用中执行所有文件。

lambda并不是为处理长时间运行的大型事务而设计的，而是为处理对小数据执行事务的小代码而设计的。

AWS Lambda函数最大内存问题

1 个答案: