Spark Structured Streaming,多个查询不会同时运行

时间:2017-07-26 15:52:10

标签: scala apache-spark spark-streaming

我从这里略微修改了示例 - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala

我添加了秒writeStream(接收器):

scala
case class MyWriter1() extends ForeachWriter[Row]{
  override def open(partitionId: Long, version: Long): Boolean = true

  override def process(value: Row): Unit = {
    println(s"custom1 - ${value.get(0)}")
  }

  override def close(errorOrNull: Throwable): Unit = true
}

case class MyWriter2() extends ForeachWriter[(String, Int)]{
  override def open(partitionId: Long, version: Long): Boolean = true

  override def process(value: (String, Int)): Unit = {
    println(s"custom2 - $value")
  }

  override def close(errorOrNull: Throwable): Unit = true
}


object Main extends Serializable{

  def main(args: Array[String]): Unit = {
    println("starting")

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)

    val host = "localhost"
    val port = "9999"

    val spark = SparkSession
      .builder
      .master("local[*]")
      .appName("app-test")
      .getOrCreate()

    import spark.implicits._

    // Create DataFrame representing the stream of input lines from connection to host:port
    val lines = spark.readStream
      .format("socket")
      .option("host", host)
      .option("port", port)
      .load()

    // Split the lines into words
    val words = lines.as[String].flatMap(_.split(" "))

    // Generate running word count
    val wordCounts = words.groupBy("value").count()

    // Start running the query that prints the running counts to the console
    val query1 = wordCounts.writeStream
      .outputMode("update")
      .foreach(MyWriter1())
      .start()

    val ds = wordCounts.map(x => (x.getAs[String]("value"), x.getAs[Int]("count")))

    val query2 = ds.writeStream
      .outputMode("update")
      .foreach(MyWriter2())
      .start()

    spark.streams.awaitAnyTermination()

  }
}

不幸的是,只有第一个查询运行,第二个从不运行(MyWriter2从未被调用)

请告知我做错了什么。根据doc:你可以在一个SparkSession中启动任意数量的查询。它们将同时运行,共享集群资源。

3 个答案:

答案 0 :(得分:1)

您使用nc -lk 9999将数据发送到火花吗?每个查询都创建与nc的连接,但nc只能将数据发送到第一个连接(查询),您可以编写tcp服务器而不是nc

答案 1 :(得分:1)

我遇到了同样的情况(但是在较新的结构化流式api上),在我的例子中,它帮助在最后一个streamingQuery上调用awaitTermination()。

s.th。像:

query1.start()
query2.start().awaitTermination()

<强>更新 相反,上面这个内置解决方案/方法更好:

sparkSession.streams.awaitAnyTermination()

答案 2 :(得分:0)

您使用.awaitAnyTermination()将在第一个流返回时终止应用程序,您必须等待两个流完成才能终止。

这样的事情应该可以解决问题:

 query1.awaitTermination()
 query2.awaitTermination()