Spark结构化流媒体,尝试mapgroupwithstate。有没有人面对.format(" console")完美运行并完美打印增量状态变化的情况,但每当我尝试更改.format(" anyStreamingSinkClass")时,收到的数据帧sink类只有当前批处理但没有状态或增量效果的内存。
case class WordCount(word:String,count:Int)
case class WordInfo(totalSum:Int)
case class WordUpdate(word:String,count:Int,expired:Boolean)
val ds = df.as[String].map{ x=>
val arr = x.split(",",-1)
WordCount( arr(0), arr(1).toInt )
}.groupByKey(_.word)
.mapGroupsWithState[WordInfo,WordUpdate](GroupStateTimeout.NoTimeout()) {
case( word:String, allWords:Iterator[WordCount], state:GroupState[WordInfo]) =>
val events = allWords.toSeq
val updatedSession = if (state.exists) {
val existingState = state.get
val updatedEvents = WordInfo(existingState.totalSum + events.map(event ⇒ event.count).sum)
updatedEvents
}
else {
WordInfo(events.map(event => event.count).sum)
}
state.update(updatedSession)
WordUpdate(word,updatedSession.totalSum,false)
}
val query = ds
.writeStream
//.format("console")
.format("com.subhankar.streamDB.ConsoleSinkProvider")
.outputMode(OutputMode.Update())
.trigger(Trigger.ProcessingTime(3.seconds))
//.option("truncate",false)
.option("checkpointLocation","out.b")
.queryName("q2090" )
.start()
query.awaitTermination()
我得到的接收器格式 批次21的独特计数为1 的x,1 批次22的独特计数是1 的x,2 批次23的独特计数为1 的x,3
对于控制台格式我得
-------------------------------------------
Batch: 1
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
| x| 1| false|
+----+-----+-------+
-------------------------------------------
Batch: 2
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
| x| 3| false|
+----+-----+-------+
-------------------------------------------
Batch: 3
-------------------------------------------
+----+-----+-------+
|word|count|expired|
+----+-----+-------+
| x| 6| false|
+----+-----+-------+
水槽做了简单的打印......
override def addBatch(batchId: Long, data: DataFrame) = {
val batchDistinctCount = data.rdd.distinct.count()
if(data.count()>0) {
println(s"Batch ${batchId}'s distinct count is ${batchDistinctCount}")
println(data.map(x=> x.getString(0) + "," + x.getInt(1)).collect().mkString(","))
}
}
答案 0 :(得分:1)
我和你的问题一样。
当我在Spark 2.2.0上测试时,状态在每个小批量之间都被重置并丢失。
然后我在Spark 2.3.0上测试了它,结果变成了抛出异常:
Queries with streaming sources must be executed with writeStream.start()
通过这个例外,我发现我的客户Sink有不受支持的操作。
在您的情况下,您的不受支持的操作是multiple aggregations。
您在一个小批量中有data.rdd.distinct.count()
和data.count()
以及data.map
,这就是所谓的多重聚合,并被视为不受支持。
虽然在Spark< 2.3你的代码可以运行错误的结果,在Spark> = 2.3上它只是获得异常。
要修复它,以下修改可以避免多次聚合,从而获得正确的结果。
override def addBatch(batchId: Long, dataframe: DataFrame) = {
val data = dataframe.collect() // now do everything in this Array (care for OUT OF MEMORY)
val batchDistinctCount = Set(data).size()
if(data.length > 0) {
println(s"Batch ${batchId}'s distinct count is ${batchDistinctCount}")
println(data.map(x=> x.getString(0) + "," + x.getInt(1)).mkString(","))
}
}