Question

我正在构建一个接收RDD并对其进行一些计算的通用函数。由于我在输入RDD上运行多个计算，我想缓存它。例如：

public JavaRDD<String> foo(JavaRDD<String> r) {
    r.cache();
    JavaRDD t1 = r... //Some calculations
    JavaRDD t2 = r... //Other calculations
    return t1.union(t2);
}

我的问题是，由于r已经提供给我，它可能已经或可能没有被缓存。如果它被缓存并再次在其上调用缓存，则会创建一个新的缓存层，这意味着在计算t1和t2时，缓存中将有r个实例？或者火花是否意识到r被缓存并将忽略它？

Answer 1

<强>没有即可。如果在缓存的RDD上调用cache，则不会发生任何事情，RDD将被缓存（一次）。像许多其他转换一样，缓存是懒惰的：

当您致电cache时，RDD的storageLevel设置为MEMORY_ONLY
当您再次拨打cache时，它会设置为相同的值（无更改）
在评估时，当底层RDD具体化时，Spark将检查RDD的storageLevel，如果它需要缓存，它将缓存它。

所以你很安全。

Answer 2

只测试我的群集，Zohar是对的，没有任何反应，它只会缓存RDD一次。我认为，原因是每个RDD内部都有id，spark会使用id来标记RDD是否已被缓存。因此，多次缓存一个RDD将无能为力。

bellow是我的代码和截图：

更新[根据需要添加代码]

### cache and count, then will show the storage info on WEB UI

raw_file = sc.wholeTextFiles('hdfs://10.21.208.21:8020/user/mercury/names', minPartitions=40)\
                 .setName("raw_file")\
                 .cache()
raw_file.count()

### try to cache and count again, then take a look at the WEB UI, nothing changes

raw_file.cache()
raw_file.count()

### try to change rdd's name and cache and count again, to see will it cache a new rdd as the new name again, still 
### nothing changes, so I think maybe it is using the RDD id as a mark, for more we need to take a detailed read on 
### the document even then source code

raw_file.setName("raw_file_2")
raw_file.cache().count()

如果我在Spark中缓存两次相同的RDD会发生什么

2 个答案: