Question

我想在开始和结束转换之间的中间记录RDD中的行数。我的代码目前看起来像这样：

val transformation1 = firstTransformation(inputdata).cache  // Is this cache recommended or can I remove it?
log("Transformation1 count: " + tranformation1.count)
val tranformation2 = secondTransformation(transformation1).cache
val finalX = transformation2.filter(row => row.contains("x"))
val finalY = tranformation2.filter(row => row.contains("y"))

我的问题是transformation1是一个巨大的RDD并占用了大量内存（它适合内存但后来会导致内存问题）。但是，我知道因为我在tranformation1（.count()和secondTransformation()）上执行了两个不同的操作，所以通常建议它应该被缓存。

这种情况可能很常见，那么推荐的处理方式是什么？您是否应始终在中间计数之前缓存RDD，还是可以删除转换1上的.cache()？

Answer 1

如果你遇到内存问题，你应该尽快解决问题，你也可以坚持使用磁盘。

val transformation1 = firstTransformation(inputdata).persist(StorageLevel.DISK_ONLY)  // Is this cache recommended or can I remove it?
log("Transformation1 count: " + tranformation1.count)
val tranformation2 = secondTransformation(transformation1).persist(StorageLevel.DISK_ONLY)
val finalX = transformation2.filter(row => row.contains("x"))
val finalY = tranformation2.filter(row => row.contains("y"))
// All the actions are done
transformation1.unpersist()
transformation2.unpersist()

如果你可以在发生内存问题之前使用unpersist，那么如果你缓存而不是在磁盘上持久存在会更好

您是否应该在中间计数之前缓存RDD？

1 个答案: