Question

这个问题是我之前提出的What happens if I cache the same RDD twice in Spark问题的后续问题。

在RDD上调用cache()时，RDD的状态是否已更改（并且返回的RDD仅为this以便于使用）或创建新的RDD以包裹现有的RDD？

以下代码会发生什么：

// Init
JavaRDD<String> a = ... // some initialise and calculation functions.
JavaRDD<String> b = a.cache();
JavaRDD<String> c = b.cache();

// Case 1, will 'a' be calculated twice in this case 
// because it's before the cache layer:
a.saveAsTextFile(somePath);
a.saveAsTextFile(somePath);

// Case 2, will the data of the calculation of 'a' 
// be cached in the memory twice in this case
// (once as 'b' and once as 'c'):
c.saveAsTextFile(somePath);

Answer 1

在RDD上调用cache（）时，RDD的状态是否发生了变化（和返回的RDD只是为了易于使用）或者创建了一个新的RDD 包裹现有的

The same RDD is returned：

/**
 * Mark this RDD for persisting using the specified level.
 *
 * @param newLevel the target storage level
 * @param allowOverride whether to override any existing level with the new one
 */
  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {
  // TODO: Handle changes of StorageLevel
  if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {
    throw new UnsupportedOperationException(
      "Cannot change storage level of an RDD after it was already assigned a level")
}
  // If this is the first time this RDD is marked for persisting, register it
  // with the SparkContext for cleanups and accounting. Do this only once.
  if (storageLevel == StorageLevel.NONE) {
    sc.cleaner.foreach(_.registerRDDForCleanup(this))
    sc.persistRDD(this)
  }
  storageLevel = newLevel
  this
}

缓存不会对所述RDD造成任何副作用。如果它已标记为持久性，则不会发生任何事情。如果不是，唯一的副作用是将其注册到SparkContext，其中副作用不是RDD本身，而是上下文。

修改

查看JavaRDD.cache，基础调用似乎会导致另一个JavaRDD的分配：

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */ def cache(): JavaRDD[T] = wrapRDD(rdd.cache())

wrapRDD调用JavaRDD.fromRDD的位置：

object JavaRDD { implicit def fromRDD[T: ClassTag](rdd: RDD[T]): JavaRDD[T] = new JavaRDD[T](rdd) implicit def toRDD[T](rdd: JavaRDD[T]): RDD[T] = rdd.rdd }

这将导致新JavaRDD的分配。也就是说，RDD[T]的内部实例将保持不变。

Answer 2

缓存不会改变RDD的状态。

当发生转换时，缓存会计算并在内存中实现RDD，同时跟踪其沿袭（依赖关系）。持久性有很多层次。

由于缓存会记住RDD的谱系，因此Spark可以在节点发生故障时重新计算丢失分区。最后，缓存的RDD存在于正在运行的应用程序的上下文中，一旦应用程序终止，缓存的RDD也会被删除。

spark中的cache（）会改变RDD的状态还是创建一个新的状态？

2 个答案: