spark数据帧操作运行了很长时间,看似完成但没有返回结果

时间:2018-05-30 21:06:05

标签: scala apache-spark apache-spark-sql

我正在创建一个火花作业,有问题的代码部分就是:

var withBansDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)

// schema is previously and correctly defined

for (id <- championIds) {

  var bansDf = dataframe.
    where(array_contains($"team1_bans", id.getInt(0)) || array_contains($"team2_bans", id.getInt(0))).
    groupBy($"patch_game_version").count().
    withColumn("champion_id", typedLit(id.getInt(0))).
    withColumnRenamed("count","banned_count")

  var banrateDf = bansDf.join(gamesByPatchDf, Seq("patch_game_version")).withColumn("ban_rate", $"banned_count"/$"games_count")

  var championBanRatesDf = inProgress.join(broadcast(banrateDf), Seq("champion_id", "patch_game_version"))

  withBansDf = withBansDf.union(championBanRatesDf)

}

这样就完成了,但是当我尝试对withBansDf数据框(例如showcount)执行任何操作时,该过程会运行很长时间,并且spark UI表示作业完成过了一会儿(10-15分钟)。但是,没有结果。不返回计数,或者根本不显示数据帧。

我认为这是因为我的工作效率低下和/或遇到了一些内存问题。如果我在for循环中为少数ids运行相同的代码块,它会按预期工作。

以下是最终的控制台输出:

18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 200 blocks
    18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
    18/05/30 16:54:57 INFO Executor: Finished task 199.0 in stage 3792.0 (TID 341227). 3597 bytes result sent to driver
    18/05/30 16:54:57 INFO TaskSetManager: Finished task 199.0 in stage 3792.0 (TID 341227) in 10 ms on localhost (executor driver) (200/200)
    18/05/30 16:54:57 INFO TaskSchedulerImpl: Removed TaskSet 3792.0, whose tasks have all completed, from pool 
    18/05/30 16:54:57 INFO DAGScheduler: ShuffleMapStage 3792 (run at ThreadPoolExecutor.java:1149) finished in 88.910 s
    18/05/30 16:54:57 INFO DAGScheduler: looking for newly runnable stages
    18/05/30 16:54:57 INFO DAGScheduler: running: Set()
    18/05/30 16:54:57 INFO DAGScheduler: waiting: Set(ResultStage 3793)
    18/05/30 16:54:57 INFO DAGScheduler: failed: Set()
    18/05/30 16:54:57 INFO DAGScheduler: Submitting ResultStage 3793 (MapPartitionsRDD[13099] at run at ThreadPoolExecutor.java:1149), which has no missing parents
    18/05/30 16:54:57 INFO MemoryStore: Block broadcast_3795 stored as values in memory (estimated size 34.9 KB, free 366.2 MB)
    18/05/30 16:54:57 INFO MemoryStore: Block broadcast_3795_piece0 stored as bytes in memory (estimated size 15.9 KB, free 366.2 MB)
    18/05/30 16:54:57 INFO BlockManagerInfo: Added broadcast_3795_piece0 in memory on 192.168.99.210:53425 (size: 15.9 KB, free: 366.3 MB)
    18/05/30 16:54:57 INFO SparkContext: Created broadcast 3795 from broadcast at DAGScheduler.scala:1006
    18/05/30 16:54:57 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 3793 (MapPartitionsRDD[13099] at run at ThreadPoolExecutor.java:1149) (first 15 tasks are for partitions Vector(0, 1))
    18/05/30 16:54:57 INFO TaskSchedulerImpl: Adding task set 3793.0 with 2 tasks
    18/05/30 16:54:57 INFO TaskSetManager: Starting task 1.0 in stage 3793.0 (TID 341228, localhost, executor driver, partition 1, PROCESS_LOCAL, 4726 bytes)
    18/05/30 16:54:57 INFO Executor: Running task 1.0 in stage 3793.0 (TID 341228)
    18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 200 blocks
    18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
    18/05/30 16:54:57 INFO Executor: Finished task 1.0 in stage 3793.0 (TID 341228). 3826 bytes result sent to driver
    18/05/30 16:54:57 INFO TaskSetManager: Starting task 0.0 in stage 3793.0 (TID 341229, localhost, executor driver, partition 0, ANY, 4726 bytes)
    18/05/30 16:54:57 INFO Executor: Running task 0.0 in stage 3793.0 (TID 341229)
    18/05/30 16:54:57 INFO TaskSetManager: Finished task 1.0 in stage 3793.0 (TID 341228) in 6 ms on localhost (executor driver) (1/2)
    18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 200 blocks
    18/05/30 16:54:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
    18/05/30 16:54:57 INFO Executor: Finished task 0.0 in stage 3793.0 (TID 341229). 3847 bytes result sent to driver
    18/05/30 16:54:57 INFO TaskSetManager: Finished task 0.0 in stage 3793.0 (TID 341229) in 6 ms on localhost (executor driver) (2/2)
    18/05/30 16:54:57 INFO TaskSchedulerImpl: Removed TaskSet 3793.0, whose tasks have all completed, from pool 
    18/05/30 16:54:57 INFO DAGScheduler: ResultStage 3793 (run at ThreadPoolExecutor.java:1149) finished in 0.013 s
    18/05/30 16:54:57 INFO DAGScheduler: Job 1266 finished: run at ThreadPoolExecutor.java:1149, took 88.934566 s

我正在寻找失败的确切原因和/或改善工作的选项,以便它不会失败。

0 个答案:

没有答案