Spark RDD生命周期:是否将RDD收回范围之外

时间:2015-04-23 04:00:02

标签: scala apache-spark

在一个方法中,我创建一个新的RDD并对其进行缓存,是否在rdd超出范围后Spark是否会自动取消RDD?

我在想,但究竟发生了什么?

1 个答案:

答案 0 :(得分:3)

不,它不会自动取消。

为什么?因为可能它看起来不再需要RDD,但是火花模型在转换需要之前不会实现RDD,所以实际上很难说“我不再需要这个RDD”了。即使对你而言,由于以下情况,这可能非常棘手:

JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging
for (int i = 0; i < 10; i++)
{
  JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]);
  rdd.cache(); // Since it will be used twice, cache.
  rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); //  Transform and save, rdd materializes
  rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another transform to T and merge by union
  rdd.unpersist(); // Now it seems not needed. (But is needed actually)

 // Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur.
 rddUnion.saveAsTextFile(mergedFileName);
}

将代码示例归功于spark-user ml