我正在读取S3中的多个文件,对其进行处理,然后使用这些已处理的数据帧在AWS RDS中创建表。我正在使用PyCharm在Mac OS上进行所有操作。
我想从S3存储桶中读取这些csv文件,并运行相同的python脚本以在AWS中而不是在本地系统上处理这些文件。我想使用lambda触发此脚本,并且只有在将所有需要的文件上传到存储桶中后,它才能运行。
AWS Lambda中的代码将如何变化?
我现在的代码如下-
import boto3
import pandas as pd
import numpy as np
import sys
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('test-s3')
#CREATE ALL THE NEEDED OBJECTS
obj1 = client.get_object(Bucket='test-s3', Key='file1.csv')
obj2 = client.get_object(Bucket='test-s3', Key='file2.csv')
obj3 = client.get_object(Bucket='test-s3', Key='file3.csv')
obj4 = client.get_object(Bucket='test-s3', Key='file4.csv')
obj5 = client.get_object(Bucket='test-s3', Key='file5.csv')
obj6 = client.get_object(Bucket='test-s3', Key='file6.csv')
obj7 = client.get_object(Bucket='test-s3', Key='file7.csv')
obj8 = client.get_object(Bucket='test-s3', Key='file8.csv')
obj9 = client.get_object(Bucket='test-s3', Key='file9.csv')
obj10 = client.get_object(Bucket='test-s3', Key='file10.csv')
obj11 = client.get_object(Bucket='test-s3', Key='file11.csv')
obj12 = client.get_object(Bucket='test-s3', Key='file12.csv')
obj13 = client.get_object(Bucket='test-s3', Key='file13.csv')
obj14 = client.get_object(Bucket='test-s3', Key='file14.csv')
obj15 = client.get_object(Bucket='test-s3', Key='file15.csv')
#CREATE ALL THE DATAFRAMES FROM RESPECTIVE OBJECTS
df_file1 = pd.read_csv(obj1['Body'], encoding='utf-8', sep = ',')
df_file2 = pd.read_csv(obj2['Body'], encoding='utf-8', sep = ',')
df_file3 = pd.read_csv(obj3['Body'], encoding='utf-8', sep = ',')
df_file4 = pd.read_csv(obj4['Body'], encoding='utf-8', sep = ',')
df_file5 = pd.read_csv(obj5['Body'], encoding='utf-8', sep = ',')
df_file6 = pd.read_csv(obj6['Body'], encoding='utf-8', sep = ',')
df_file7 = pd.read_csv(obj7['Body'], encoding='utf-8', sep = ',')
df_file8 = pd.read_csv(obj8['Body'], encoding='utf-8', sep = ',')
df_file9 = pd.read_csv(obj9['Body'], encoding='utf-8', sep = ',')
df_file10 = pd.read_csv(obj10['Body'], encoding='utf-8', sep = ',')
df_file11 = pd.read_csv(obj11['Body'], encoding='utf-8', sep = ',')
df_file12 = pd.read_csv(obj12['Body'], encoding='utf-8', sep = ',')
df_file13 = pd.read_csv(obj13['Body'], encoding='utf-8', sep = ',')
df_file14 = pd.read_csv(obj14['Body'], encoding='utf-8', sep = ',')
df_file15 = pd.read_csv(obj15['Body'], encoding='utf-8', sep = ',')
#+++++++++++ make a function to process the data frames ++++++++++++
def function(df_file1, df_file2):
*** some logic ***
return df_final
## MAKE THE TABLES IN RDS
from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://USERNAME:PASSWORD@***.eu-central-1.rds.amazonaws.com:5432/DBNAME')
df_final.to_sql('table name', engine, schema='data')
我是AWS Lambda的菜鸟。如何在Lambda上运行此脚本?
接受Ninad的建议后,我编辑了脚本。如下-
import boto3
import pandas as pd
import numpy as np
import sys
client = boto3.client('s3')
resource = boto3.resource('s3')
my_bucket = resource.Bucket('test-s3')
def function(df_file1, df_file2):
*** some logic ***
return df_final
def lambda_handler(event, context):
obj1 = client.get_object(Bucket='test-s3', Key='file1.csv')
obj2 = client.get_object(Bucket='test-s3', Key='file2.csv')
obj3 = client.get_object(Bucket='test-s3', Key='file3.csv')
df_file1 = pd.read_csv(obj1['Body'], encoding='utf-8', sep=',')
df_file2 = pd.read_csv(obj2['Body'], encoding='utf-8', sep=',')
df_file3 = pd.read_csv(obj3['Body'], encoding='utf-8', sep=',')
df_final = function(df_file1, df_file2)
from sqlalchemy import create_engine
import psycopg2
engine = create_engine('postgresql://USERNAME:PASSWORD@***.eu-central-1.rds.amazonaws.com:5432/DBNAME')
df_final.to_sql('table name', engine, schema='data')
我在本地系统中创建了一个虚拟环境,并安装了所有软件包-pandas,SQLAlchemy等。我压缩了此软件包和脚本并将其上传到Lambda。现在我收到此错误-
[ERROR] Runtime.ImportModuleError: Unable to import module 'lambda_function': No module named 'pandas'
我已经按照aws package deploy link打包了所有必要的东西。为什么我仍然会出现错误?
答案 0 :(得分:0)
使用控制台创建lambda。选择所需的正确python版本,并确保已分配足够的内存,并将超时时间设置为15分钟(最大)。创建lambda时,还可以让您为其附加角色。创建一个角色,并将策略附加到该角色,该策略使您可以访问CSV所在的s3存储桶。
下一步是为您的lambda创建一个图层,该图层将具有运行脚本所需的所有依赖项。 Lambda默认情况下安装了boto3软件包,但您将需要安装pandas(及其所有依赖项),sqlalchemy和psycopg2。您可以找到有关如何进行此操作的简单教程here
现在您已经创建了一个图层,将该图层附加到您的lambda上。
我们终于可以继续执行您的脚本了。由于您需要读取s3路径上的所有csv文件,因此必须更改脚本以动态读取csv文件。当前,您已经对csv文件的名称进行了硬编码。您可以更改脚本以首先使用以下内容获取存储桶中的所有键:
response = client.list_objects_v2(
Bucket=my_bucket
)['Contents']
这将为您提供密钥列表。如果需要,请过滤它们。
接下来,您可以通过循环遍历这样的响应来创建多个数据框:
d = {}
for idx, obj in enumerate(response):
d['df_'+idx] = pd.read_csv(client.get_object(Bucket='test-s3', Key=obj['Key'])['Body'], encoding='utf-8', sep = ',')
这将创建一个包含所有数据帧的字典d。请先在本地尝试此代码以更正所有错误。
现在复制您的最终代码,并将其粘贴到def lambda handler():
上方的lambda编辑器中,从lambda处理函数中调用您的函数。