为什么我的AWS胶粘作业仅使用一个执行程序和驱动程序?

时间:2018-07-24 23:03:58

标签: amazon-web-services pyspark aws-glue

在我的脚本中,我将pyspark中的所有dynamicframe都转换为dataframe,并执行了groupbyjoin的操作。然后在matrics视图中,我发现无论我设置了多少DPU,都只有一个执行程序处于活动状态。

约2小时后,工作失败

  

诊断:容器   [pid = 8417,containerID = container_1532458272694_0001_01_000001]是   运行超出物理内存限制。当前使用:5.5 GB的5.5 GB   使用的物理内存;使用了7.7 GB的27.5 GB虚拟内存。杀人   容器。

我有大约20亿行数据。我的DPU设置为80。

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "in_json", transformation_ctx = "datasource0")
datasource1 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "out_json", transformation_ctx = "datasource0")


applymapping0 = ApplyMapping.apply(frame = datasource0, mappings = [("fieldA", "int", "fieldA", "int"), ("fieldB", "string", "fieldB", "string")], transformation_ctx = "applymapping1")
applymapping1 = ApplyMapping.apply(frame = datasource1, mappings = [("fieldA", "int", "fieldA", "int"), ("fieldB", "string", "fieldB", "string")], transformation_ctx = "applymapping1")

df1 = applymapping0.toDF().groupBy("fieldA").agg(count('*').alias("total_number_1"))
df2 = applymapping1.toDF().groupBy("fieldA").agg(count('*').alias("total_number_2"))

df1.join(df2, "fieldB")

result = DynamicFrame.fromDF(result_joined, glueContext, "result")

datasink2 = glueContext.write_dynamic_frame.from_options(frame = result, connection_type = "s3", connection_options = {"path": "s3://test-bucket"}, format = "json", transformation_ctx = "datasink2")
job.commit()

我有什么想念吗?

2 个答案:

答案 0 :(得分:1)

尝试repartition DataFrame。您可以重新分区based on a columnto an arbitrary number of partitionsboth

类似这样的东西:

df1 = applymapping0.toDF().groupBy("fieldA").agg(count('*').alias("total_number_1"))
df2 = applymapping1.toDF().groupBy("fieldA").agg(count('*').alias("total_number_2"))

df1_r = df1.repartition(df1("fieldB"))
df2_r = df2.repartition(df2("fieldB"))

df1_r.join(df2_r, "fieldB")

答案 1 :(得分:0)

事实证明,这是因为我的输入数据太大,所以它卡在只有一个执行程序处于活动状态的开头。一旦开始计算,我将看到多个执行程序处于活动状态。

df1.repartition(df1("fieldB"))实际上会使它变慢,也许我没有正确使用它。