Question

我无法弄清楚这一点，但是我试图在AWS Glue中使用直接输出提交者：

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2

是否可以将此配置与AWS Glue一起使用？

Answer 1

选项1：

胶水使用spark上下文，您也可以将hadoop配置设置为aws胶水。因为内部动态框架是一种数据框架。

sc._jsc.hadoopConfiguration().set("mykey","myvalue")

我认为您也需要像这样添加相应的课程

sc._jsc.hadoopConfiguration().set("mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter")

示例代码段：

 sc = SparkContext()

    sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.algorithm.version",2)

    glueContext = GlueContext(sc)
    spark = glueContext.spark_session

以证明该配置存在....

在python中调试：

sc._conf.getAll() // print this

在scala中调试：

sc.getConf.getAll.foreach(println)

选项2：

您尝试使用胶水的工作参数的另一面：

https://docs.aws.amazon.com/glue/latest/dg/add-job.html 具有键值属性，如文档中所述

'--myKey' : 'value-for-myKey'

您可以按照下面的屏幕快照编辑作业并使用--conf

指定参数

选项3：
如果您正在使用aws cli，可以在下面尝试... https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

他们在文档 不要设置 消息中提到了

有趣，如下所示。但我不知道为什么它被暴露。

总结：我个人更喜欢 option1 ，因为程序控制。

Answer 2

转到粘贴作业控制台并按如下所示编辑作业：

胶水>作业>编辑您的作业>脚本库和作业参数（可选）>作业参数

设置以下内容：

键：--conf值：

spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version = 2

将Spark fileoutputcommitter.algorithm.version = 2与AWS Glue一起使用

2 个答案: