我的上下文是我有一个Spark自定义接收器,该接收器从http端点接收数据流。 httpend点每30秒更新一次新数据。因此,对于我的Spark Streaming应用程序而言,在30秒的时间内聚合整个数据没有意义,因为这显然会导致重复数据(当我将dstream保存为文件时,代表rdd的每个零件文件是完全相同的)。
为了避免此重复数据删除过程,我需要此窗口的5秒切片。我想在DStream API中使用slice函数。有两种使用此功能的方法 1.切片(fromTime:时间,toTime:时间) 2. slice(interval:Interval)
尽管第二个选项是公共方法,但是Interval类是私有的。我对Spark的jira提出了一个问题,但这是另一个问题(https://issues.apache.org/jira/browse/SPARK-27206)
我的问题仅针对第一个选项。我执行以下操作
val sparkSession = getSparkSession(APP_NAME)
val batchInterval:Duration = Durations.seconds(30)
val windowDuration:Duration = Durations.seconds(60)
val slideDuration:Duration = Durations.seconds(30)
val ssc = new StreamingContext(sparkSession.sparkContext, batchInterval)
ssc.checkpoint("some path")
val memTotal:ReceiverInputDStream[String] = ssc.receiverStream(new MyReceiver("http endpoint", true))
val dstreamMemTotal = memTotal.window(windowDuration, slideDuration)
到目前为止,一切都很好。但是,当我添加以下幻灯片功能时
val a = dstreamMemTotal.slice(currentTime, currentTime.+(Durations.seconds(5)))
我收到以下错误。
exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.dstream.WindowedDStream@62315f22 has not been initialized
at org.apache.spark.streaming.dstream.DStream$$anonfun$slice$2.apply(DStream.scala:880)
at org.apache.spark.streaming.dstream.DStream$$anonfun$slice$2.apply(DStream.scala:878)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:699)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:265)
at org.apache.spark.streaming.dstream.DStream.slice(DStream.scala:878)
请问任何指针?