我有可以在AWS GLUE中运行的pyspark脚本。但是每次我从UI创建作业并将代码复制到该作业时,无论如何,我都可以从s3存储桶中的文件自动创建作业。 (我拥有将在运行时使用的所有库和粘合上下文)
答案 0 :(得分:3)
另一种替代方法是使用AWS CloudFormation。您可以在模板文件中定义要创建的所有AWS资源(不仅是Glue作业),然后根据需要在AWS Console或using cli中更新堆栈。
Glue job的模板如下:
MyJob:
Type: AWS::Glue::Job
Properties:
Command:
Name: glueetl
ScriptLocation: "s3://aws-glue-scripts//your-script-file.py"
DefaultArguments:
"--job-bookmark-option": "job-bookmark-enable"
ExecutionProperty:
MaxConcurrentRuns: 2
MaxRetries: 0
Name: cf-job1
Role: !Ref MyJobRole # reference to a Role resource which is not presented here
答案 1 :(得分:0)
是的,有可能。例如,您可以为此目的使用boto3框架。
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-calling.html
答案 2 :(得分:0)
我写了以下脚本:
您可以编写shell脚本来实现它。
答案 3 :(得分:0)
我创建了一个名为 datajob
的开源库来部署和编排粘合作业。您可以在 github https://github.com/vincentclaes/datajob 和 pypi
pip install datajob
npm install -g aws-cdk@1.87.1
您创建一个文件 datajob_stack.py
来描述您的粘合作业及其编排方式:
from datajob.datajob_stack import DataJobStack
from datajob.glue.glue_job import GlueJob
from datajob.stepfunctions.stepfunctions_workflow import StepfunctionsWorkflow
with DataJobStack(stack_name="data-pipeline-simple") as datajob_stack:
# here we define 3 glue jobs with a relative path to the source code.
task1 = GlueJob(
datajob_stack=datajob_stack,
name="task1",
job_path="data_pipeline_simple/task1.py",
)
task2 = GlueJob(
datajob_stack=datajob_stack,
name="task2",
job_path="data_pipeline_simple/task2.py",
)
task3 = GlueJob(
datajob_stack=datajob_stack,
name="task3",
job_path="data_pipeline_simple/task3.py",
)
# we instantiate a step functions workflow and add the sources
# we want to orchestrate.
with StepfunctionsWorkflow(
datajob_stack=datajob_stack, name="data-pipeline-simple"
) as sfn:
[task1, task2] >> task3
要部署代码以粘合执行:
export AWS_PROFILE=my-profile
datajob deploy --config datajob_stack.py
非常感谢任何feedback!