{
var history: RDD[(String, List[String]) = sc.emptyRDD()
val dstream1 = ...
val dstream2 = ...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
val joined = historyDStream.join(dstream2)
... do stuff with joined as above, obtain dstreamFiltered ...
dstreamFiltered.foreachRDD{rdd =>
val formatted = rdd.map{case (k,(v1,v2)) => (k,v1) }
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
history.persist(StorageLevel.MEMORY_AND_DISK) // cache the computation
history.count() //action to materialize this transformation
}
此代码逻辑正常工作,可以保留所有以前没有成功加入并保存用于未来批次的RDD,这样每当我们获得具有此RDD的相应加入密钥的记录时,我们执行连接但是我没有得到这段历史的建立。
答案 0 :(得分:1)
通过观察RDD血统如何随时间演变,我们可以了解历史如何在这种情况下积累。
我们需要两项先前的知识:
让我们看一个简单的例子,使用经典的wordCount:
val txt = sparkContext.textFile(someFile)
val words = txt.flatMap(_.split(" "))
简单来说,txt
是HadoopRDD(someFile)
。 words
是MapPartitionsRDD(txt, flatMapFunction)
。我们将lineage
个单词称为DAG(直接非循环图),由此操作链组成:HadoopRDD <-- MapPartitionsRDD
。
我们可以将相同的原则应用于我们的流媒体操作:
在迭代0,我们有
var history: RDD[(String, List[String]) = sc.emptyRDD()
// -> history: EmptyRDD
...
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd.union(EmptyRDD)
join, filter
// underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred)
map
// -> underlying RDD: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.unpersist(false)
// EmptyRDD.unpersist (does nothing, it was never persisted)
history = formatted
// history = ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
history.persist(...)
// history marked for persistence (at the next action)
history.count()
// ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f).count()
// cache result of: ((rdd.union(EmptyRDD).join(otherRDD)).filter(pred).map(f)
在迭代1中,我们(将rdd0,rdd1添加为迭代索引):
val historyDStream = dstream1.transform(rdd => rdd.union(history))
// -> underlying RDD: rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f))
join, filter
// underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred)
map
// -> underlying RDD: ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.unpersist(false)
// history0.unpersist (marks the previous result for removal, we used it already for our computation above)
history = formatted
// history1 = ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f)
history.persist(...)
// new history marked for persistence (at the next action)
history.count()
// ((rdd1.union(((rdd0.union(EmptyRDD).join(otherRDD0)).filter(pred).map(f)).join(otherRDD1)).filter(pred).map(f).count()
// cache result sothat we don't need to compute it next time
每次迭代都会继续这个迭代过程。
正如我们所看到的,表示RDD计算的图表在不断增长。 cache
降低了每次进行所有计算的成本。每隔一段时间就需要checkpoint
来编写这个增长图的具体计算值,这样我们就可以将它用作基线而不必评估整个链。
查看此过程的一个有趣方法是在foreachRDD中添加一行来检查当前的谱系:
...
history.unpersist(false) // unpersist the 'old' history RDD
history = formatted // assign the new history
println(history.toDebugString())
...