Question

假设我们有以下代码。

x = sc.textFile(...)
y = x.map(...)
z = x.map(...)

在此处缓存x是否必不可少？缓存x不会让spark读取输入文件两次吗？

Answer 1

这些东西没有必要让Spark读取输入两次。

列出所有可能的场景：

示例1：文件甚至无法读取

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

在这种情况下，它不会做任何事情，因为转换没有动作。

示例2：文件读取一次

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD

只有为y读取文件才能使其映射

示例3：文件读取两次

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

它只会在使用动作时两次读取输入文件随着转型。

示例4：文件读取一次

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(z.count())    #Action of RDD

示例5：文件读取两次

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

由于动作现在用于两个不同的RDD，因此它会读取两次。

示例6：文件读取一次

x = sc.textFile(...)    #creation of RDD
y = x.map(...).cache()    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

即使现在，只有RDD执行并存储到内存中才会使用两个不同的操作。现在第二个操作发生在Cached RDD上。

编辑：附加信息

所以问题出现了什么缓存和什么不缓存？
Ans：您将一次又一次使用的RDD需要缓存 示例7：

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

因此，在这种情况下，我们一次又一次地使用x。因此建议缓存x。因为它不必一次又一次地从源中读取x。因此，如果您正在处理大量数据，这将为您节省大量时间。

假设您开始使用/不使用序列化将所有RDD作为内存/磁盘中的缓存。如果执行任何任务，如果Spark有更少的内存，那么它将开始使用LRU（最近最近使用）策略删除旧的RDD。每当再次使用移除的RDD时，它将执行从源到达的所有步骤，直到RDD转换

如果多次使用，是否需要缓存RDD？

1 个答案: