Question

我有一个静态DataFrame，它具有如下所示的数百万行。

静态DataFrame：

--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

现在每批处理中都将形成一个流DataFrame，其中包含ID和经过以下类似操作后更新的time_stamp。

第一批：

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------

现在每一批，我都想用Streaming Dataframe的更新值来更新Static DataFrame，如下所示。 该怎么做？

第一批之后的静态DF：

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

我已经尝试过 except（），union（）或'left_anti'连接。但是看来结构化流式传输不支持此类操作。

Answer 1

因此，我通过Spark 2.4.0 AddBatch方法解决了此问题，该方法将流式数据帧转换为小型批处理数据帧。但是对于<2.4.0版本，仍然令人头疼。

Answer 2

我有一个类似的问题。以下是我已申请更新静态数据框的foreachBatch。我想知道如何返回在foreachBatch中完成的更新的df。

def update_reference_df(df, static_df):
    query: StreamingQuery = df \
        .writeStream \
        .outputMode("append") \
        .format("memory") \
        .foreachBatch(lambda batch_df, batchId: update_static_df(batch_df, static_df)) \
        .start()
    return query

def update_static_df(batch_df, static_df):
    df1: DataFrame = static_df.union(batch_df.join(static_df,
                                                 (batch_df.SITE == static_df.SITE)
                                                 "left_anti"))

    return df1

Answer 3

正如Swarup自己已经解释的那样，如果使用Spark 2.4.x，则可以使用forEachBatch输出接收器。

接收器使用函数(batchDF: DataFrame, batchId: Long) => Unit，其中batchDF是流数据帧的当前处理的批处理，可以用作静态数据帧。因此，在此功能中，您可以使用每个批次的值更新另一个数据框。

请参见以下示例：假设您有一个名为frameToBeUpdated的数据框，该数据框具有与实例变量相同的模式，并且您想要在其中保持状态

df
  .writeStream
  .outputMode("append")
  .foreachBatch((batch: DataFrame, batchId: Long) => {
   //batch is a static dataframe

      //take all rows from the original frames that aren't in batch and 
      //union them with the batch, then reassign to the
      //dataframe you want to keep
      frameToBeUpdated = batch.union(frameToBeUpdated.join(batch, Seq("id"), "left_anti"))
    })
    .start()

更新逻辑来自：spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

如何在Spark结构化流中使用流数据帧更新静态数据帧

3 个答案: