Question

我使用AWS Data Pipelines运行每晚填充表格以进行汇总统计的SQL查询。用户界面有点时髦，但最终我把它搞定了。

现在我想用python脚本做类似的事情。我有一个文件，我每天早上都在我的笔记本电脑上运行（forecast_rev.py）但当然这意味着我必须打开我的笔记本电脑并每天开启它。当然，我可以安排一条管道做同样的事情，因此去度假而不在乎。

对于我的生活，我找不到关于此的教程，AWS文档或StackOverflow！我甚至不确定如何开始。有没有人有一个他们愿意分享步骤的简单管道？

Answer 1

您需要将您的python脚本存储在S3存储桶
创建安装python和所有依赖项的Shell脚本，将您的python脚本从S3复制到本地存储并运行它。 Shell script example。
将此shell脚本存储在S3
使用ShellCommandActivity启动您的shell脚本。

您可以使用此模板作为示例： http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-template-redshiftrdsfull.html 它使用存储在s3上的python脚本将MySQL模式转换为RedShift模式。

运行python程序的python shell脚本示例：

#!/bin/bash
curl -O https://s3.amazonaws.com/datapipeline-us-east-1/sample-scripts/mysql_to_redshift.py
python mysql_to_redshift.py

Answer 2

我遇到了类似的情况，这里我是怎么过来的我将描述我是如何用Ec2Resource做的。如果您正在寻找EMRCluster的解决方案，请参阅@franklinsijo answer。

的步骤
的 1 即可。将您的python脚本存储在s3中的 2 即可。创建一个shell脚本（hello.sh）（下面给出）并将其存储到s3
<强> 3 即可。创建Ec2Resource节点和ShellCommandActivity节点并提供这些信息。

在“脚本Uri”中提供shell脚本S3网址，并在ShellCommandActivity中将“stage”设置为true。它应该在你的DefaultResource上运行

这是shell脚本（hello.sh），它从s3下载你的python程序并在本地存储，安装python和所需的第三方库，最后执行你的python文件。

hello.sh

echo 'Download python file to local temp'
aws s3 cp s3://path/to/python_file/hello_world.py /tmp/hello.py
# Install python(on CentOs )
sudo yum -y install python-pip
pip install <dependencies>
python /tmp/hello.py

我在尝试使用bang line时遇到了困难，所以请不要将它们包含在这里如果aws cp命令不起作用（awscli较旧），这是这种情况的快速解决方案。

按照上面的步骤1-3，创建一个s3DataNode I.在S3DataNode的“文件路径”中提供你的python s3 url II。提供DataNode作为ShellCommandActivity的“输入” III。在ShellCommandActivity

命令

echo 'Install Python2'
sudo yum -y install python-pip
pip install <dependencies>
python ${INPUT1_STAGING_DIR}/hello_world.py

通过AWS Data Pipelines运行python脚本

2 个答案: