将数据流转换为全局静态rdd并使用静态rdd周期

时间:2015-05-06 12:53:55

标签: apache-spark spark-streaming

伪代码:

object myApp { 
  var myStaticRDD: RDD[Int] 
  def main() { 
    ...  //init streaming context, and get two DStream (streamA and streamB) from two hdfs path 

    //complex transformation using the two DStream 
    val new_stream = streamA.transformWith(StreamB, (a, b, t) => { 
        a.join(b).map(...) 
      } 
    ) 

    //join the new_stream's rdd with myStaticRDD 
    new_stream.foreachRDD(rdd => 
      myStaticRDD = myStaticRDD.union(cur_stream) 
    ) 


    // do complex model training every two hours. 
    if (hour is 0, 2, 4, 6...) { 
       model_training(myStaticRDD)   //will take 1 hour 
    } 
  } 
} 

我不知道如何编写代码来使用那个时刻的myStaticRDD每两个小时实现一次训练模型。

当模型训练正在运行时,流任务也可以同时正常运行,并且streamA,streamB,new_stream,myStaticRDD可以实时更新。也许我需要使用多线程?

一种可能的解决方案可能是:

   object myApp { 
      var myStaticRDD: RDD[Int] 
      def main() { 
        ...  //init streaming context, and get two DStream (streamA and streamB) from two hdfs path 

        //complex transformation using the two DStream 
        val new_stream = streamA.transformWith(StreamB, (a, b, t) => { 
            a.join(b).map(...) 
          } 
        ) 

        //join the new_stream's rdd with myStaticRDD 
        new_stream.foreachRDD(rdd => 
          myStaticRDD = myStaticRDD.union(cur_stream) 
        // do complex model training every two hours. 
          if (hour is 0, 2, 4, 6...) { 
             val tmp_rdd = myStaticRDD 
             new Thread(model_training(tmp_rdd))   //start a child-thread to train the model...
          } 
        ) 


      } 
    } 

0 个答案:

没有答案