为什么Spark多次执行每个任务

时间:2019-07-01 12:19:47

标签: apache-spark apache-spark-sql

Spark application stages

在我的Spark应用程序中,我看到同一任务分多个阶段执行。但是这些语句在代码中只定义了一次。而且,不同阶段中的相同任务需要花费不同的时间来执行。我了解在RDD丢失的情况下,任务沿袭用于重新计算RDD。我如何确定是否是这种情况,因为在此应用程序的所有运行中都看到了相同的现象。有人可以解释一下这里发生了什么,以及在什么条件下可以分阶段计划任务。

代码非常类似于以下内容:

val events = getEventsDF()
events.cache()

metricCounter.inc("scec", events.count())

val scEvents = events.filter(_.totalChunks == 1)
    .repartition(NUM_PARTITIONS, lit(col("eventId")))

val sortedEvents = events.filter(e => e.totalChunks > 1 && e.totalChunks <= maxNumberOfChunks)
    .map(PartitionUtil.createKeyValueTuple)
    .rdd
    .repartitionAndSortWithinPartitions(new EventDataPartitioner(NUM_PARTITIONS))

val largeEvents = events.filter(_.totalChunks > maxNumberOfChunks).count()

val mcEvents = sortedEvents.mapPartitionsWithIndex[CFEventLog](
    (index: Int, iter: Iterator[Tuple2]) => doSomething())

val mcEventsDF = session.sqlContext.createDataset[CFEventLog](mcEvents)

metricCounter.inc("mcec", mcEventsDF.count())

val currentDf = scEvents.unionByName(mcEventsDF)

val distinctDateHour = currentDf.select(col("eventDate"), col("eventHour"))
    .distinct
    .collect

val prevEventsDF = getAnotherDF(distinctDateHour)

val finalDf = currentDf.unionByName(prevEventsDF).dropDuplicates(Seq("eventId"))

finalDf
      .write.mode(SaveMode.Overwrite)
      .partitionBy("event_date", "event_hour")
      .saveAsTable("table")

val finalEventsCount = finalDf.count()

每个count()动作是否导致在执行该动作之前重新执行RDD转换?

谢谢, Devj

0 个答案:

没有答案