在我的Spark应用程序中,我看到同一任务分多个阶段执行。但是这些语句在代码中只定义了一次。而且,不同阶段中的相同任务需要花费不同的时间来执行。我了解在RDD丢失的情况下,任务沿袭用于重新计算RDD。我如何确定是否是这种情况,因为在此应用程序的所有运行中都看到了相同的现象。有人可以解释一下这里发生了什么,以及在什么条件下可以分阶段计划任务。
代码非常类似于以下内容:
val events = getEventsDF()
events.cache()
metricCounter.inc("scec", events.count())
val scEvents = events.filter(_.totalChunks == 1)
.repartition(NUM_PARTITIONS, lit(col("eventId")))
val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
.map(PartitionUtil.createKeyValueTuple)
.rdd
.repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))
val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()
val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
(index: Int, iter: Iterator[Tuple2]) => doSomething())
val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)
metricCounter.inc("mcec", mcEventsDF.count())
val currentDf = scEvents.unionByName(mcEventsDF)
val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
.distinct
.collect
val prevEventsDF = getAnotherDF(distinctDateHour)
val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))
finalDf
.write.mode(SaveMode.Overwrite)
.partitionBy("event_date", "event_hour")
.saveAsTable("table")
val finalEventsCount = finalDf.count()
每个count()
动作是否导致在执行该动作之前重新执行RDD转换?
谢谢, Devj