为什么AWS Glue作业有时似乎使用较少的执行程序

时间:2019-05-10 09:08:22

标签: amazon-web-services apache-spark pyspark apache-spark-sql aws-glue

我试图理解,为什么我的Glue作业在大多数情况下似乎无法完全并行运行?

enter image description here

如图所示,似乎在工作的一半左右,大多数执行者会停下来吗?我猜有些工作负载不是真正并行的吗?我能理解哪些功能会导致这种情况吗?还是这是预期的?

当我检查CloudWatch Logs时,似乎只看到以下内容:

19/05/10 09:05:15 INFO Client: Application report for application_1557470405923_0001 (state: RUNNING)
19/05/10 09:05:15 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.43.166
ApplicationMaster RPC port: 0
queue: default
start time: 1557471194764
final status: UNDEFINED
tracking URL: http://ip-172-31-42-62.ap-southeast-1.compute.internal:20888/proxy/application_1557470405923_0001/
user: root
19/05/10 09:05:16 INFO Client: Application report for application_1557470405923_0001 (state: RUNNING)
19/05/10 09:05:16 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.43.166
ApplicationMaster RPC port: 0
queue: default
start time: 1557471194764
final status: UNDEFINED
tracking URL: http://ip-172-31-42-62.ap-southeast-1.compute.internal:20888/proxy/application_1557470405923_0001/
user: root
19/05/10 09:05:17 INFO Client: Application report for application_1557470405923_0001 (state: RUNNING)
19/05/10 09:05:17 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 172.31.43.166
ApplicationMaster RPC port: 0
queue: default
start time: 1557471194764
final status: UNDEFINED
tracking URL: http://ip-172-31-42-62.ap-southeast-1.compute.internal:20888/proxy/application_1557470405923_0001/
user: root

好像它仍在执行我的相当大的Spark SQL查询...因为我看不到任何表明相反的日志...有什么办法可以理解Glue在各个方面的作用?例如。在这个查询上还是其他?

0 个答案:

没有答案