伪代码:
object myApp {
var myStaticRDD: RDD[Int]
def main() {
... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path
//complex transformation using the two DStream
val new_stream = streamA.transformWith(StreamB, (a, b, t) => {
a.join(b).map(...)
}
)
//join the new_stream's rdd with myStaticRDD
new_stream.foreachRDD(rdd =>
myStaticRDD = myStaticRDD.union(cur_stream)
)
// do complex model training every two hours.
if (hour is 0, 2, 4, 6...) {
model_training(myStaticRDD) //will take 1 hour
}
}
}
我不知道如何编写代码来使用那个时刻的myStaticRDD每两个小时实现一次训练模型。
当模型训练正在运行时,流任务也可以同时正常运行,并且streamA,streamB,new_stream,myStaticRDD可以实时更新。也许我需要使用多线程?
一种可能的解决方案可能是:
object myApp {
var myStaticRDD: RDD[Int]
def main() {
... //init streaming context, and get two DStream (streamA and streamB) from two hdfs path
//complex transformation using the two DStream
val new_stream = streamA.transformWith(StreamB, (a, b, t) => {
a.join(b).map(...)
}
)
//join the new_stream's rdd with myStaticRDD
new_stream.foreachRDD(rdd =>
myStaticRDD = myStaticRDD.union(cur_stream)
// do complex model training every two hours.
if (hour is 0, 2, 4, 6...) {
val tmp_rdd = myStaticRDD
new Thread(model_training(tmp_rdd)) //start a child-thread to train the model...
}
)
}
}