结构化流式传输mapGroupWithState不适用于自定义接收器

时间:2017-11-23 05:11:49

标签: apache-spark spark-structured-streaming

Spark结构化流媒体,尝试mapgroupwithstate。有没有人面对.format(" console")完美运行并完美打印增量状态变化的情况,但每当我尝试更改.format(" anyStreamingSinkClass")时,收到的数据帧sink类只有当前批处理但没有状态或增量效果的内存。

case class WordCount(word:String,count:Int)
case class WordInfo(totalSum:Int)
case class WordUpdate(word:String,count:Int,expired:Boolean)


val ds = df.as[String].map{ x=>
  val arr = x.split(",",-1)
  WordCount( arr(0), arr(1).toInt )
}.groupByKey(_.word)
  .mapGroupsWithState[WordInfo,WordUpdate](GroupStateTimeout.NoTimeout()) {
  case( word:String, allWords:Iterator[WordCount], state:GroupState[WordInfo]) =>
    val events = allWords.toSeq
    val updatedSession = if (state.exists) {
      val existingState = state.get
      val updatedEvents = WordInfo(existingState.totalSum + events.map(event ⇒ event.count).sum)
      updatedEvents
    }
    else {
      WordInfo(events.map(event => event.count).sum)
    }
    state.update(updatedSession)

    WordUpdate(word,updatedSession.totalSum,false)

}


val query = ds  
  .writeStream
    //.format("console")
  .format("com.subhankar.streamDB.ConsoleSinkProvider")
  .outputMode(OutputMode.Update())
  .trigger(Trigger.ProcessingTime(3.seconds))
  //.option("truncate",false)
 .option("checkpointLocation","out.b")
  .queryName("q2090" )
  .start()

query.awaitTermination()

我得到的接收器格式 批次21的独特计数为1 的x,1 批次22的独特计数是1 的x,2 批次23的独特计数为1 的x,3

对于控制台格式我得

-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
|   x|    1|  false|
+----+-----+-------+

-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
|   x|    3|  false|
+----+-----+-------+

-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
|   x|    6|  false|
+----+-----+-------+

水槽做了简单的打印......

override def addBatch(batchId: Long, data: DataFrame) = {

  val batchDistinctCount = data.rdd.distinct.count()
  if(data.count()>0) {
    println(s"Batch ${batchId}'s distinct count is ${batchDistinctCount}")
    println(data.map(x=> x.getString(0) + "," + x.getInt(1)).collect().mkString(","))
  }
}

1 个答案:

答案 0 :(得分:1)

我和你的问题一样。

当我在Spark 2.2.0上测试时,状态在每个小批量之间都被重置并丢失。

然后我在Spark 2.3.0上测试了它,结果变成了抛出异常:

Queries with streaming sources must be executed with writeStream.start()

通过这个例外,我发现我的客户Sink有不受支持的操作。

在您的情况下,您的不受支持的操作是multiple aggregations

您在一个小批量中有data.rdd.distinct.count()data.count()以及data.map,这就是所谓的多重聚合,并被视为不受支持。

虽然在Spark< 2.3你的代码可以运行错误的结果,在Spark> = 2.3上它只是获得异常。

要修复它,以下修改可以避免多次聚合,从而获得正确的结果。

override def addBatch(batchId: Long, dataframe: DataFrame) = {
  val data = dataframe.collect()  // now do everything in this Array (care for OUT OF MEMORY)
  val batchDistinctCount = Set(data).size()
  if(data.length > 0) {
    println(s"Batch ${batchId}'s distinct count is ${batchDistinctCount}")
    println(data.map(x=> x.getString(0) + "," + x.getInt(1)).mkString(","))
  }
}