在Spark中,一个动作后坚持是好的做法吗?

时间:2018-03-22 18:13:32

标签: scala apache-spark

给出这个例子;

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}
    val oneRDD = someRDD.reduceByKey(_+_)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

哪个更好?

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}.persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    val oneRDD = someRDD.reduceByKey(_+_)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

OR

    val someRDD = firstRDD.flatMap{ case(x,y) => SomeFunc(y)}.persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    val oneRDD = someRDD.reduceByKey(_+_).persist(storage.StorageLevel.MEMORY_AND_DISK_SER)
    oneRDD.saveAsNewAPIHadoopFile("dir/to/write/to", classOf[Text], classOf[Text], classOf[TextOutputFormat[Text, Text]])

还是其他什么?

我发现当你在同一个RDD上执行多个操作时,坚持下去是件好事。

示例是;

val newRDD = context.parallelize(0 until numMappers, numPartitions).persist(storage.StorageLevel.MEMORY_AND_DISK_SER)  #persisted bc there are two follow on actions preformed on it.
newRDD.count() #same RDD
newRDD.saveAsNewAPIHadoopFile() #same RDD
...other actions etc.

这里只有一个RDD和两个行动。我应该坚持所有。

1 个答案:

答案 0 :(得分:1)

From Spark documentation:

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

(I added bold around the above statement)

Note that chaining transformations is fine. The performance problem would occur when when reusing an RDD