Question

我只是按照以下方式执行一项非常简单的工作

glueContext = GlueContext(SparkContext.getOrCreate())
l_table = glueContext.create_dynamic_frame.from_catalog(
             database="gluecatalog",
             table_name="fctable") 
l_table = l_table.drop_fields(['seq','partition_0','partition_1','partition_2','partition_3']).rename_field('tbl_code','table_code')
print "Count: ", l_table.count()
l_table.printSchema()
l_table.select_fields(['trans_time']).toDF().distinct().show()
dfc = l_table.relationalize("table_root", "s3://my-bucket/temp/")
print "Before keys() call "
dfc.keys()
print "After keys() call "
l_table.select_fields('table').printSchema()
dfc.select('table_root_table').toDF().where("id = 1 or id = 2").orderBy(['id','index']).show()
dfc.select('table_root').toDF().where("table = 1 or table = 2").show()

数据结构也很简单

root
|-- table: array
| |-- element: struct
| | |-- trans_time: string
| | |-- seq: null
| | |-- operation: string
| | |-- order_date: string
| | |-- order_code: string
| | |-- tbl_code: string
| | |-- ship_plant_code: string
|-- partition_0
|-- partition_1
|-- partition_2
|-- partition_3

当我进行工作测试时，需要12到16分钟才能完成。但云监视日志显示该作业花了2秒钟显示我的所有数据。

所以我的问题是： AWS Glue作业在日志记录之外花费的时间可以显示在何处以及它是否在记录期之外进行了什么？

Answer 1

花时间设置允许代码运行的环境。我遇到了同样的问题，联系了AWS GLUE团队并且他们很有帮助。花费很长时间的原因是，如果您运行第一个作业（保持活动1小时），如果您在一小时内运行相同的脚本或任何其他脚本，则GLUE会构建一个环境，下一个作业将花费更少的时间。当你运行第一个脚本时，他们称之为冷启动。我的第一个工作花了17分钟，我在第一个工作完成后再次运行同样的工作，仅用了3分钟。

Answer 2

在执行编辑作业的操作时，您可以在＆＃34;脚本库和作业参数（可选）＆＃34;下添加更多DPU。部分。它有所帮助，但不要指望有任何重大改进，我的经验。

Answer 3

截至2019年5月的更新-

冷启动时间= 7-8分钟
暖池维护= 10-15分钟

AWS Glue需要很长时间才能完成

3 个答案: