无需停止应用即可重启流式查询

时间:2017-08-10 15:30:04

标签: apache-spark spark-streaming

我尝试使用以下代码在query.awaitTermination()中使用下面的代码重新启动spark中的流查询,下面的代码将在无限循环内部并查找触发器以重新启动查询然后执行下面的代码。基本上我试图刷新缓存的df

 query.processAllavaialble()
    query.stop()
   //oldDF is a cached Dataframe created from GlobalTempView which is of size 150GB.
          oldDF.unpersist()
    val inputDf: DataFrame = readFile(spec, sparkSession) //read file from S3
    or anyother source
    val recreateddf = inputDf.persist()
    //Start the query// here should i start query again by invoking readStream ?

但当我查看spark文档时,它说

void processAllAvailable() ///documentation says This method is intended for testing/// Blocks until all available data in the source has been processed and committed to the sink. This method is intended for testing. Note that in the case of continually arriving data, this method may block forever. Additionally, this method is only guaranteed to block until data that has been synchronously appended data to a Source prior to invocation. (i.e. getOffset must immediately reflect the addition).


stop() Stops the execution of this query if it is running. This method blocks until the threads performing execution has stopped.

最重要的是在不停止我的火花流应用程序的情况下重启查询的更好方法

1 个答案:

答案 0 :(得分:0)

这对我有用。

下面是我在 spark 2.4.5 中为左外连接和左连接遵循的场景。下面的过程是推动 spark 读取最新的维度数据变化。

流程用于批量维度的流连接(始终更新)

第 1 步:-

在开始 Spark 流作业之前:- 确保维度批处理数据文件夹只有一个文件,并且该文件应该至少有一个记录(由于某种原因放置空文件不起作用)/

第 2 步:- 启动您的流媒体作业并在 kafka 流中添加流记录

第 3 步:- 用值覆盖 dim 数据(文件应同名,不要更改,维度文件夹应只有一个文件) 注意:- 不要使用 spark 写入此文件夹,请使用 Java 或 Scala filesystem.io 覆盖文件或 bash 删除文件并替换为新的同名数据文件。

第 4 步:- 在下一批中,spark 能够在加入 kafka 流时读取更新的维度数据...

示例代码:-

package com.databroccoli.streaming.streamjoinupdate

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.types.{StringType, StructField, StructType, TimestampType}
import org.apache.spark.sql.{DataFrame, SparkSession}

object BroadCastStreamJoin3 {

  def main(args: Array[String]): Unit = {
    @transient lazy val logger: Logger = Logger.getLogger(getClass.getName)

    Logger.getLogger("akka").setLevel(Level.WARN)
    Logger.getLogger("org").setLevel(Level.ERROR)
    Logger.getLogger("com.amazonaws").setLevel(Level.ERROR)
    Logger.getLogger("com.amazon.ws").setLevel(Level.ERROR)
    Logger.getLogger("io.netty").setLevel(Level.ERROR)

    val spark = SparkSession
      .builder()
      .master("local")
      .getOrCreate()

    val schemaUntyped1 = StructType(
      Array(
        StructField("id", StringType),
        StructField("customrid", StringType),
        StructField("customername", StringType),
        StructField("countrycode", StringType),
        StructField("timestamp_column_fin_1", TimestampType)
      ))

    val schemaUntyped2 = StructType(
      Array(
        StructField("id", StringType),
        StructField("countrycode", StringType),
        StructField("countryname", StringType),
        StructField("timestamp_column_fin_2", TimestampType)
      ))

    val factDf1 = spark.readStream
      .schema(schemaUntyped1)
      .option("header", "true")
      .csv("src/main/resources/broadcasttest/fact")


    val dimDf3 = spark.read
      .schema(schemaUntyped2)
      .option("header", "true")
      .csv("src/main/resources/broadcasttest/dimension")
      .withColumnRenamed("id", "id_2")
      .withColumnRenamed("countrycode", "countrycode_2")

    import spark.implicits._

    factDf1
      .join(
        dimDf3,
        $"countrycode_2" <=> $"countrycode",
        "inner"
      )
      .writeStream
      .format("console")
      .outputMode("append")
      .start()
      .awaitTermination

  }
}