Question

我有两个数据流，我每1分钟加入一次。那些未联接的记录将在下一批记录中联接。这意味着对于每次迭代，我都有两个输出，一个连接的记录，另一个未缝合。加入的记录将保存在目录中，未缝合的记录将在下一次迭代中使用。

我正在缓存它，但是在下一次迭代中，该rdd为空。

我试图缓存rdd并在下一次迭代中使用。

def getUnmatchedRecord(stream1: DStream[(String,String)],stream2: DStream[(String,String)],joinedStream:DStream[(String,String)]):(DStream[(String,String)],DStream[(String,String)])={

    val unmathcedStream1=stream1.leftOuterJoin(joinedStream).filter(x=>x._2._2==None).map(x=>(x._1,x._2._1))
    unmathcedStream1.foreachRDD(x=>{
      println(x.cache().count())
    })

    val 
 unmathcedStream2=stream2.leftOuterJoin(joinedStream).filter(x=>x._2._2==None).map(x=>(x._1,x._2._1))
    unmathcedStream2.foreachRDD(x=>{
      println(x.cache().count())
    })
    (unmathcedStream1,unmathcedStream2)
  }

def main(args: Array[String]): Unit = {
//
//

val t=ssc.queueStream(new mutable.Queue[RDD[(String,String)]])
var unmatchedRecord=getUnmatchedRecord(t,t,t,t)
val stream1 = ssc.textFileStream("").join(unmatchedRecord._1)
val stream2 = ssc.textFileStream("").join(unmatchedRecord._2)

val finalResult=stream1.join(stream2)

unmatchedRecord=getUnmatchedRecord(stream1 ,stream2,finalResult)

如何保存前一批中的DSStream并在下一批中使用：Spark Streaming

0 个答案: