Question

我使用Spark 1.1.0并尝试将图形加载到GraphX中。我的代码的一小部分如下所示：

val distinct = context.union(r1, r2).distinct;
distinct.cache()

val zipped = distinct.zipWithUniqueId
zipped.cache
distinct.unpersist(false)

当我在群集上执行它时，执行的第一个阶段是：

distinct at Test.scala:72

但是在此操作完成后，我无法在＆＃34;存储＆＃34;中看到一个条目。 Spark UI的选项卡。下一阶段是：

zipWithUniqueId at Test.scala:78

但在此之后它又开始了这个：

distinct at Test.scala:72

不应该缓存此结果吗？如果只使用一次RDD，它是否有用？

编辑：

我忘了提到我在zipWithUniqueId at Test.scala:78

时遇到获取失败

获取问题的可能解决方案

可能的解决方案被描述为here，这可能是Spark版本1.1.0中的错误。

来自spark-user邮件列表的Andrew Ash的可能解决方案：

目前在1.1中似乎有三件事导致FetchFailures：

1）执行者上的长GC（长于spark.core.connection.ack.wait.timeout默认为60秒）

2）打开的文件太多（在ulimit -n上达到内核限制）

3）在该票据上跟踪了一些未确定的问题

Source

Answer 1

cache将在第一次评估RDD时应用。这意味着，为了有效，cache应该在产生RDD的某个动作之前，您将使用多次。鉴于cache应用于RDD评估，如果您具有仅执行一次的线性RDD谱系，则缓存将仅占用内存而不提供任何优势。

所以，如果你的管道是：

val distinct = context.union(r1, r2).distinct;
val zipped = distinct.zipWithUniqueId
zipped.cache

在cache和distinct之间使用zipped将毫无用处，除非您需要再次访问distinct数据。在你之后立刻给予你unpersisting，这让我想到了。

简而言之，如果评估的RDD将被多次使用，则仅使用.cache。（例如，迭代算法，查找，......）

缓存spark-shell示例：

val rdd = sc.makeRDD( 1 to 1000)
val cached = rdd.cache // at this point, nothing in the console

SparkUI persistence tab: no persisted RDDs

cached.count // at this point, you can see cached in the console
res0: Long = 1000

SparkUI persistence tab: cached RDD is available

val zipped = cached.zipWithUniqueId
val zipcache = zipped.cache // again nothing new on the UI
val zipcache.first // first is an action and will trigger RDD evaluation

SparkUI showing the 2nd cached rdd

cached.unpersist(blocking=true) // force immediate unpersist

SparkUI not showing unpersisted RDD anymore

如何使用cache（）正确？

编辑：

获取问题的可能解决方案

1 个答案: