我什么时候可以解除我的RDD?

时间:2017-12-01 21:28:35

标签: scala apache-spark caching rdd

我有一个被多次调用的方法。此方法如下所示:

def separateGoodAndBad(myRDD: RDD[String]): RDD[String] = {
    val newRDD = myRDD.map(......)  //do stuff
    newRDD.cache  //newRDD has 2 actions performed on it

    val badRDD = newRDD.filter(row => row.contains("bad"))
    badRDD.count

    val goodRDD = newRDD.filter(row => row.contains("good"))
    goodRDD.count

    newRDD.unpersist // I am unpersisting because this method gets called several times

    goodRDD
}

就像我说的那样,我想要取消newRDD,因为该方法被多次调用,我不想要4个不同的缓存newRDDs副本。这是一个代码示例:

val firstRDD = separateGoodAndBad(originalRDD)
val firstRDDTransformed = doStuffToFirstRDD(firstRDD)

val secondRDD = separateGoodAndBad(firstRDDTransformed)
val secondRDDTransformed = doStuffToSecondRDD(secondRDD)

val thirdRDD = separateGoodAndBad(secondRDDTransformed)
val thirdRDDTransformed = doStuffToThirdRDD(thirdRDD)

但是,secondRDDthirdRDD因为我添加了unpersist而花费的时间更长了(请参阅上面的separateGoodAndBad()。似乎他们不得不重新计算newRDD

我什么时候可以取消newRDD,以便永远不必重新计算?

1 个答案:

答案 0 :(得分:1)

当您执行goodRDD时,您可能还要计算一次goodRDD.count,并且当您在doStuffToFirstRDD方法内对该RDD执行某些操作时,它将再次重新计算。< / p>

    def separateGoodAndBad(myRDD: RDD[String]): RDD[String] = {
        val newRDD = myRDD.map(......)  //do stuff
        newRDD.cache  //newRDD has 2 actions performed on it

        val badRDD = newRDD.filter(row => row.contains("bad"))
        badRDD.count

        val goodRDD = newRDD.filter(row => row.contains("good"))
        goodRDD.cache    // this will cache goodRDD to avoid recomputing in next call
        goodRDD.count

        newRDD.unpersist // I am unpersisting because this method gets called several times

        goodRDD
    }

然后你可以在函数调用之外取消它们:

val firstRDD = separateGoodAndBad(originalRDD)
val firstRDDTransformed = doStuffToFirstRDD(firstRDD)

val secondRDD = separateGoodAndBad(firstRDDTransformed)
firstRDD .unpersist  //as your secondRDD will be cached by above `separateGoodAndBad` call
val secondRDDTransformed = doStuffToSecondRDD(secondRDD)

val thirdRDD = separateGoodAndBad(secondRDDTransformed)
secondRDD.unpersist  //as your thirdRDD will be cached by above `separateGoodAndBad` call
val thirdRDDTransformed = doStuffToThirdRDD(thirdRDD)