嗨请帮助理解Spark DAG模型根据spark official Spark中的所有转换都是懒惰的。默认情况下,每次转换RDD都可以在每次运行动作时重新计算。所以我写了一个小程序,如以下
scala> val lines = sc.textFile("C:\\Spark\\README.md")
lines: org.apache.spark.rdd.RDD[String] = C:\Spark\README.md MapPartitionsRDD[1] at textFile at <console>:24
scala> val breakLInes = lines.flatMap(line=>line.split(" "))
breakLInes: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:26
scala> val createTuple = breakLInes.map(line=>(line,1))
createTuple: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:28
scala> val wordCount = createTuple.reduceByKey
reduceByKey reduceByKeyLocally
scala> val wordCount = createTuple.reduceByKey(_+_)
wordCount: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:30
scala> wordCount.first
res0: (String, Int) = (package,1)
再次执行
scala> wordCount.first
res0: (String, Int) = (package,1)
现在转移到下面的火花UI是第二个动作的DAG可视化
现在的问题是 默认情况下,每次对其执行操作时,每个转换的RDD都可以重新计算。 那么为什么它会跳过第2阶段它应该再次拥有计算了第2阶段,因为没有进行缓存?