从S3存储桶中读取多个文件,然后使用Lambda触发器进行处理

时间:2019-07-04 16:32:48

标签: amazon-s3 aws-lambda boto3 amazon-rds

我正在读取S3中的多个文件,对其进行处理,然后使用这些已处理的数据帧在AWS RDS中创建表。我正在使用PyCharm在Mac OS上进行所有操作。

我想从S3存储桶中读取这些csv文件,并运行相同的python脚本以在AWS中而不是在本地系统上处理这些文件。我想使用lambda触发此脚本,并且只有在将所有需要的文件上传到存储桶中后,它才能运行。

AWS Lambda中的代码将如何变化?

我现在的代码如下-

import boto3
import pandas as pd
import numpy as np
import sys

client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('test-s3')



#CREATE ALL THE NEEDED OBJECTS
obj1 = client.get_object(Bucket='test-s3', Key='file1.csv')
obj2 = client.get_object(Bucket='test-s3', Key='file2.csv')
obj3 = client.get_object(Bucket='test-s3', Key='file3.csv')
obj4 = client.get_object(Bucket='test-s3', Key='file4.csv')
obj5 = client.get_object(Bucket='test-s3', Key='file5.csv')
obj6 = client.get_object(Bucket='test-s3', Key='file6.csv')
obj7 = client.get_object(Bucket='test-s3', Key='file7.csv')
obj8 = client.get_object(Bucket='test-s3', Key='file8.csv')
obj9 = client.get_object(Bucket='test-s3', Key='file9.csv')
obj10 = client.get_object(Bucket='test-s3', Key='file10.csv')
obj11 = client.get_object(Bucket='test-s3', Key='file11.csv')
obj12 = client.get_object(Bucket='test-s3', Key='file12.csv')
obj13 = client.get_object(Bucket='test-s3', Key='file13.csv')
obj14 = client.get_object(Bucket='test-s3', Key='file14.csv')
obj15 = client.get_object(Bucket='test-s3', Key='file15.csv')


#CREATE ALL THE DATAFRAMES FROM RESPECTIVE OBJECTS
df_file1 = pd.read_csv(obj1['Body'], encoding='utf-8', sep = ',')
df_file2 = pd.read_csv(obj2['Body'], encoding='utf-8', sep = ',')
df_file3 = pd.read_csv(obj3['Body'], encoding='utf-8', sep = ',')
df_file4 = pd.read_csv(obj4['Body'], encoding='utf-8', sep = ',')
df_file5 = pd.read_csv(obj5['Body'], encoding='utf-8', sep = ',')
df_file6 = pd.read_csv(obj6['Body'], encoding='utf-8', sep = ',')
df_file7 = pd.read_csv(obj7['Body'], encoding='utf-8', sep = ',')
df_file8 = pd.read_csv(obj8['Body'], encoding='utf-8', sep = ',')
df_file9 = pd.read_csv(obj9['Body'], encoding='utf-8', sep = ',')
df_file10 = pd.read_csv(obj10['Body'], encoding='utf-8', sep = ',')
df_file11 = pd.read_csv(obj11['Body'], encoding='utf-8', sep = ',')
df_file12 = pd.read_csv(obj12['Body'], encoding='utf-8', sep = ',')
df_file13 = pd.read_csv(obj13['Body'], encoding='utf-8', sep = ',')
df_file14 = pd.read_csv(obj14['Body'], encoding='utf-8', sep = ',')
df_file15 = pd.read_csv(obj15['Body'], encoding='utf-8', sep = ',')


#+++++++++++ make a function to process the data frames ++++++++++++


def function(df_file1, df_file2):
     *** some logic ***

        return df_final



## MAKE THE TABLES IN RDS

from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://USERNAME:PASSWORD@***.eu-central-1.rds.amazonaws.com:5432/DBNAME')
df_final.to_sql('table name', engine, schema='data')

我是AWS Lambda的菜鸟。如何在Lambda上运行此脚本?

接受Ninad的建议后,我编辑了脚本。如下-

import boto3
import pandas as pd
import numpy as np
import sys

client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('test-s3')

def function(df_file1, df_file2):
     *** some logic ***

        return df_final



def lambda_handler(event, context):
    obj1 = client.get_object(Bucket='test-s3', Key='file1.csv')
    obj2 = client.get_object(Bucket='test-s3', Key='file2.csv')
    obj3 = client.get_object(Bucket='test-s3', Key='file3.csv')


    df_file1 = pd.read_csv(obj1['Body'], encoding='utf-8', sep=',')
    df_file2 = pd.read_csv(obj2['Body'], encoding='utf-8', sep=',')
    df_file3 = pd.read_csv(obj3['Body'], encoding='utf-8', sep=',')


    df_final = function(df_file1, df_file2)

    from sqlalchemy import create_engine
    import psycopg2
    engine = create_engine('postgresql://USERNAME:PASSWORD@***.eu-central-1.rds.amazonaws.com:5432/DBNAME')
    df_final.to_sql('table name', engine, schema='data')

我在本地系统中创建了一个虚拟环境,并安装了所有软件包-pandas,SQLAlchemy等。我压缩了此软件包和脚本并将其上传到Lambda。现在我收到此错误-

[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'pandas'

我已经按照aws package deploy link打包了所有必要的东西。为什么我仍然会出现错误?

1 个答案:

答案 0 :(得分:0)

使用控制台创建lambda。选择所需的正确python版本,并确保已分配足够的内存,并将超时时间设置为15分钟(最大)。创建lambda时,还可以让您为其附加角色。创建一个角色,并将策略附加到该角色,该策略使您可以访问CSV所在的s3存储桶。

下一步是为您的lambda创建一个图层,该图层将具有运行脚本所需的所有依赖项。 Lambda默认情况下安装了boto3软件包,但您将需要安装pandas(及其所有依赖项),sqlalchemy和psycopg2。您可以找到有关如何进行此操作的简单教程here

现在您已经创建了一个图层,将该图层附加到您的lambda上。

我们终于可以继续执行您的脚本了。由于您需要读取s3路径上的所有csv文件,因此必须更改脚本以动态读取csv文件。当前,您已经对csv文件的名称进行了硬编码。您可以更改脚本以首先使用以下内容获取存储桶中的所有键:

response = client.list_objects_v2(
    Bucket=my_bucket
)['Contents']

这将为您提供密钥列表。如果需要,请过滤它们。

接下来,您可以通过循环遍历这样的响应来创建多个数据框:

d = {}
for idx, obj in enumerate(response):
    d['df_'+idx] = pd.read_csv(client.get_object(Bucket='test-s3', Key=obj['Key'])['Body'], encoding='utf-8', sep = ',')

这将创建一个包含所有数据帧的字典d。请先在本地尝试此代码以更正所有错误。

现在复制您的最终代码,并将其粘贴到def lambda handler():上方的lambda编辑器中,从lambda处理函数中调用您的函数。