我发现我的AWS Glue作业正在将重复数据追加到我的数据目录中。我有一个读取JSON的工作,并使用Spark SQL对它进行了删除,然后尝试将其保存到数据目录中。但我一定做错了,因为每次任务运行时数据目录都会重复出现更多
inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://..."], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")
inputDF = inputGDF.toDF()
print(inputDF.first())
inputDF.createOrReplaceTempView("p_places")
# Processing the data to dedup it based on ID
filteredDF = spark.sql("""
SELECT id, parentid, code, type, name, createdat, updatedat
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY updatedat DESC) ROW_NUM,
id, parentid, code, type, name, createdat, updatedat
FROM p_places
)
WHERE ROW_NUM = 1
""")
filteredGDF = DynamicFrame.fromDF(filteredDF, glueContext, "filteredGDF")
filteredDF.createOrReplaceTempView('p_places_2')
verification = spark.sql("""
SELECT COUNT(id) FROM p_places_2 WHERE id = '12542'
""")
print("VERIFICATION:")
print(verification.first()) # Correctly output 1 (no dups)
outputGDF = glueContext.write_dynamic_frame.from_options(frame = filteredGDF, connection_type = "s3", connection_options = {"path": "s3://..."}, format = "parquet", transformation_ctx = "outputGDF")
job.commit()
但是当我使用Athena查询数据时,每次运行都会有1个额外的重复行。这是为什么?我怀疑对镶木地板文件的写入将始终附加吗?我该如何解决?
答案 0 :(得分:0)
您的代码仅从输入数据中删除重复项。但是,如果您不希望将其放在目标位置,则需要加载这些现有数据,然后只写新记录:
existingGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://..."], "recurse": True}, format = "parquet", transformation_ctx="existingGDF")
newOnlyDF = filteredDF.alias("new")
.join(existingDf.alias("existing"), col("ID"), "left_outer")
.where(col("existing.ID").isNull())
.select("new.*")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = newOnlyDF, connection_type = "s3", connection_options = {"path": "s3://..."}, format = "parquet", transformation_ctx = "outputGDF")