Question

我发现我的AWS Glue作业正在将重复数据追加到我的数据目录中。我有一个读取JSON的工作，并使用Spark SQL对它进行了删除，然后尝试将其保存到数据目录中。但我一定做错了，因为每次任务运行时数据目录都会重复出现更多

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://..."], "recurse": True}, format = "csv", format_options = {"withHeader": True}, transformation_ctx="inputGDF")

inputDF = inputGDF.toDF()

print(inputDF.first())

inputDF.createOrReplaceTempView("p_places")

# Processing the data to dedup it based on ID
filteredDF = spark.sql("""
  SELECT id, parentid, code, type, name, createdat, updatedat 
  FROM (
    SELECT 
      ROW_NUMBER() OVER (PARTITION BY ID ORDER BY updatedat DESC) ROW_NUM,
      id, parentid, code, type, name, createdat, updatedat
    FROM p_places
  )
  WHERE ROW_NUM = 1
""")

filteredGDF = DynamicFrame.fromDF(filteredDF, glueContext, "filteredGDF")

filteredDF.createOrReplaceTempView('p_places_2')
verification = spark.sql("""
    SELECT COUNT(id) FROM p_places_2 WHERE id = '12542'
""")
print("VERIFICATION:")
print(verification.first()) # Correctly output 1 (no dups)

outputGDF = glueContext.write_dynamic_frame.from_options(frame = filteredGDF, connection_type = "s3", connection_options = {"path": "s3://..."}, format = "parquet", transformation_ctx = "outputGDF")

job.commit()

但是当我使用Athena查询数据时，每次运行都会有1个额外的重复行。这是为什么？我怀疑对镶木地板文件的写入将始终附加吗？我该如何解决？

Answer 1

您的代码仅从输入数据中删除重复项。但是，如果您不希望将其放在目标位置，则需要加载这些现有数据，然后只写新记录：

existingGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://..."], "recurse": True}, format = "parquet", transformation_ctx="existingGDF")

newOnlyDF = filteredDF.alias("new")
  .join(existingDf.alias("existing"), col("ID"), "left_outer")
  .where(col("existing.ID").isNull())
  .select("new.*")

outputGDF = glueContext.write_dynamic_frame.from_options(frame = newOnlyDF, connection_type = "s3", connection_options = {"path": "s3://..."}, format = "parquet", transformation_ctx = "outputGDF")

AWS Glue将重复记录追加到数据目录

1 个答案: