Question

My Glue作业读取一个表（一个S3 csv文件），然后对其进行分区并在S3上写入10个Json文件。

我注意到对于结果文件中的某些行，有些列已经消失了！

这是一行：

etalab_named_postgre_csv = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "tab", transformation_ctx = "datasource0")
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
path_s3 = "s3://Bucket"
etalab_named_postgre_csv = applymapping_etalab_named_postgre_csv.toDF()
etalab_named_postgre_csv.repartition(10).write.format("json").option("sep",",").option("header", "true").option("mode","Overwrite").save(path_s3)

在输出文件中，某些列会消失！

我在EMR上使用Spark加载了相同的输入表，以检查消失的列是否存在。

这是一种常见的胶水行为吗？我该如何防止呢？

编辑：

我现在确定是问题所在。

似乎胶水映射是问题的根源。

applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")

我声明compldistrib是一个字符串，并且希望在输出中将其作为字符串。如果一行在compldistrib中包含一个数值，则映射将忽略它！

这是一个错误吗？

Answer 1

因此，经过数小时的搜索，我没有找到解决方案。我发现的替代方法是使用EMR将Spark作业替换为Glue作业。速度也快很多。

我希望这会对某人有所帮助。

创建多个分区时胶水作业正在删除列

1 个答案: