为什么Spark独立报告" ExecutorLostFailure(执行者驱动程序丢失)"与cogroup?

时间:2015-10-06 14:51:49

标签: scala apache-spark

我正在使用cogroup功能(对于两个数据集,一个9 GB,另一个110 KB)在独立模式下运行Spark,如下所示:

我有128 GB RAM和24个核心。我的配置是:

set("spark.executor.memory","64g")
set("spark.driver.memory","64g")

IntelliJ VM选项:-Xmx128G

从代码中可以看出,我已将数据分成1000个部分。我也分别尝试了5000和10000,因为countByKey在我的情况下非常昂贵。

从其他一些StackOverflow帖子我看到spark.default.parallelism选项。我该如何调整配置?我是否需要向IntelliJ VM选项添加更多内容?我应该使用spark.default.parallelism吗?

val emp = sc.textFile("\\text1.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))}
val emp_new = sc.textFile("\\text2.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))}

val cog = emp.cogroup(emp_new)
val skk = cog.flatMap {
            case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
              for { e1 <- l1.toSeq; e2 <- l2.toSeq } yield ((e1, e2), 1)
          }
val com = skk.countByKey()

当使用1000和5000个分区时,countByKey会溢出太多溢出,当使用10000分区时,我开始得到一些结果,至少某些任务已完成。但过了一段时间我得到了如下所示的错误:

15/10/06 14:01:17 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 451457 ms exceeds timeout 120000 ms
15/10/06 14:01:17 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 451457 ms
15/10/06 14:01:17 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 2.0
15/10/06 14:01:17 WARN TaskSetManager: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost)
15/10/06 14:01:17 ERROR TaskSetManager: Task 109 in stage 2.0 failed 1 times; aborting job
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 91), so marking it as still running
15/10/06 14:01:17 WARN TaskSetManager: Lost task 34.0 in stage 2.0 (TID 20036, localhost): ExecutorLostFailure (executor driver lost)
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 118), so marking it as still running
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 100), so marking it as still running
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 76), so marking it as still running

...

15/10/06 14:01:17 INFO TaskSchedulerImpl: Cancelling stage 2
15/10/06 14:01:17 INFO DAGScheduler: ShuffleMapStage 2 (countByKey at ngram.scala:39) failed in 1020,915 s
15/10/06 14:01:17 INFO DAGScheduler: Job 0 failed: countByKey at ngram.scala:39, took 3025,563964 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 109 in stage 2.0 failed 1 times, most recent failure: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost)
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)

0 个答案:

没有答案