在使用Spark结构化流处理Databricks Delta表中的流数据时处理重复数据吗?

时间:2019-10-09 12:16:42

标签: apache-spark databricks spark-structured-streaming azure-databricks delta-lake

我正在将Spark结构化流与Azure Databricks Delta配合使用,在其中写入Delta表(增量表名称为原始表)。我从Azure文件中读取数据,其中接收到乱序数据,并且其中有2列” smtUidNr”和“ msgTs”。我尝试通过在代码中使用Upsert处理重复项,但是当我查询增量表“ raw”时。我在增量表中看到以下重复记录

    smtUidNr                                 msgTs
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000

以下是我的代码:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._


// merge duplicates
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {


  microBatchOutputDF.createOrReplaceTempView("updates")


  microBatchOutputDF.sparkSession.sql(s"""
    MERGE INTO raw t
    USING updates s
    ON (s.smtUidNr = t.smtUidNr and s.msgTs>t.msgTs) 
    WHEN MATCHED THEN UPDATE SET * 
    WHEN NOT MATCHED THEN INSERT *
  """)
}


val df=spark.readStream.format("delta").load("abfss://abc@hjklinfo.dfs.core.windows.net/entrypacket/")
df.createOrReplaceTempView("table1")
val entrypacket_DF=spark.sql("""SELECT details as dcl,invdetails as inv,eventdetails as evt,smtdetails as smt,msgHdr.msgTs,msgHdr.msgInfSrcCd FROM table1 LATERAL VIEW explode(dcl) dcl AS details LATERAL VIEW explode(inv) inv AS invdetails LATERAL VIEW explode(evt) evt as eventdetails LATERAL VIEW explode(smt) smt as smtdetails""").dropDuplicates()


entrypacket_DF.createOrReplaceTempView("ucdx")

//Here, we are adding a column date_timestamp which converts msgTs timestamp to YYYYMMDD format in column date_timestamp which eliminates duplicate for today & then we drop this column meaning which we are not tampering with msgTs column
val resultDF=spark.sql("select dcl.smtUidNr,dcl,inv,evt,smt,cast(msgTs as timestamp)msgTs,msgInfSrcCd from ucdx").withColumn("date_timestamp",to_date(col("msgTs"))).dropDuplicates(Seq("smtUidNr","date_timestamp")).drop("date_timestamp")

resultDF.createOrReplaceTempView("final_tab")

val finalDF=spark.sql("select distinct smtUidNr,max(dcl) as dcl,max(inv) as inv,max(evt) as evt,max(smt) as smt,max(msgTs) as msgTs,max(msgInfSrcCd) as msgInfSrcCd from final_tab group by smtUidNr")


finalDF.writeStream.format("delta").foreachBatch(upsertToDelta _).outputMode("update").start()

结构化流不支持聚合,窗口功能和子句排序?我该怎么做才能在代码中进行修改,以使我只能保存1条特定smtUidNr的记录?

2 个答案:

答案 0 :(得分:0)

您需要做的是在foreachBatch方法中进行重复操作,因此您可以确保每个批处理合并为每个键只写入一个值。

在您的示例中,您将执行以下操作:

def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {

  microBatchOutputDF
    .select('smtUidNr, struct('msgTs, 'dcl, 'inv, 'evt, 'smt, 'msgInfSrcCd).as("cols"))
    .groupBy('smtUidNr)
    .agg(max('cols).as("latest"))
    .select("smtUidNr", "latest.*")
    .createOrReplaceTempView("updates")

  microBatchOutputDF.sparkSession.sql(s"""
    MERGE INTO raw t
    USING updates s
    ON (s.smtUidNr = t.smtUidNr and s.msgTs>t.msgTs) 
    WHEN MATCHED THEN UPDATE SET * 
    WHEN NOT MATCHED THEN INSERT *
  """)
}

finalDF.writeStream.foreachBatch(upsertToDelta _).outputMode("update").start()

您可以在文档here

上看到更多示例。

答案 1 :(得分:0)

如果存在多个具有相同唯一 ID 的行,以下代码段可帮助您找到最新记录。如果多行完全相同,也只选取一行。

让您用来过滤行/记录的唯一键/键为“id”。 您有一个“时间戳”列可用于查找同一 ID 的最新记录。

def upsertToDelta(micro_batch_df, batchId) :
   delta_table = DeltaTable.forName(spark, f'{database}.{table_name}')
   df = micro_batch_df.dropDuplicates(['id']) \
       .withColumn("r", rank().over(Window.partitionBy('id') \
       .orderBy(col('timestamp').desc()))).filter("r==1").drop("r")
   delta_table.alias("t") \
      .merge(df.alias("s"), 's.id = t.id') \
      .whenMatchedUpdateAll() \
      .whenNotMatchedInsertAll() \
      .execute()
final_df.writeStream \
  .foreachBatch(upsertToDelta) \
  .option('checkpointLocation', '/mnt/path/checkpoint') \
  .outputMode('update') \
  .start()
相关问题