我有一个被多次调用的方法。此方法如下所示:
def separateGoodAndBad(myRDD: RDD[String]): RDD[String] = {
val newRDD = myRDD.map(......) //do stuff
newRDD.cache //newRDD has 2 actions performed on it
val badRDD = newRDD.filter(row => row.contains("bad"))
badRDD.count
val goodRDD = newRDD.filter(row => row.contains("good"))
goodRDD.count
newRDD.unpersist // I am unpersisting because this method gets called several times
goodRDD
}
就像我说的那样,我想要取消newRDD
,因为该方法被多次调用,我不想要4个不同的缓存newRDDs
副本。这是一个代码示例:
val firstRDD = separateGoodAndBad(originalRDD)
val firstRDDTransformed = doStuffToFirstRDD(firstRDD)
val secondRDD = separateGoodAndBad(firstRDDTransformed)
val secondRDDTransformed = doStuffToSecondRDD(secondRDD)
val thirdRDD = separateGoodAndBad(secondRDDTransformed)
val thirdRDDTransformed = doStuffToThirdRDD(thirdRDD)
但是,secondRDD
和thirdRDD
因为我添加了unpersist而花费的时间更长了(请参阅上面的separateGoodAndBad()
。似乎他们不得不重新计算newRDD
。
我什么时候可以取消newRDD
,以便永远不必重新计算?
答案 0 :(得分:1)
当您执行goodRDD
时,您可能还要计算一次goodRDD.count
,并且当您在doStuffToFirstRDD
方法内对该RDD执行某些操作时,它将再次重新计算。< / p>
def separateGoodAndBad(myRDD: RDD[String]): RDD[String] = {
val newRDD = myRDD.map(......) //do stuff
newRDD.cache //newRDD has 2 actions performed on it
val badRDD = newRDD.filter(row => row.contains("bad"))
badRDD.count
val goodRDD = newRDD.filter(row => row.contains("good"))
goodRDD.cache // this will cache goodRDD to avoid recomputing in next call
goodRDD.count
newRDD.unpersist // I am unpersisting because this method gets called several times
goodRDD
}
然后你可以在函数调用之外取消它们:
val firstRDD = separateGoodAndBad(originalRDD)
val firstRDDTransformed = doStuffToFirstRDD(firstRDD)
val secondRDD = separateGoodAndBad(firstRDDTransformed)
firstRDD .unpersist //as your secondRDD will be cached by above `separateGoodAndBad` call
val secondRDDTransformed = doStuffToSecondRDD(secondRDD)
val thirdRDD = separateGoodAndBad(secondRDDTransformed)
secondRDD.unpersist //as your thirdRDD will be cached by above `separateGoodAndBad` call
val thirdRDDTransformed = doStuffToThirdRDD(thirdRDD)