来自现有RDD的Spark Streaming

时间:2016-01-29 15:31:40

标签: java hadoop apache-spark spark-streaming

任何人都可以帮我解决如何从现有RDD创建DStream的过程。 我的代码是:

sets= [set(l) for l in ll]
res = [[list(sets[i]|sets[j]) for j in range(i+1,N)] for i in range(N)]

现在我需要使用这些 rddd 作为 JavaStreamingContext 的输入。

1 个答案:

答案 0 :(得分:2)

尝试使用queueStream API RDD作为流队列,每个推入队列的RDD将被视为DStream中的一批数据,并像流一样处理。

public <T> InputDStream<T> queueStream(scala.collection.mutable.Queue<RDD<T>> queue,
                              boolean oneAtATime,
                              scala.reflect.ClassTag<T> evidence$15)

Create an input stream from a queue of RDDs. In each batch, it will process either one or all of the RDDs returned by the queue.
NOTE: Arbitrary RDDs can be added to queueStream, there is no way to recover data of those RDDs, so queueStream doesn't support checkpointing.