我正在使用cogroup
功能(对于两个数据集,一个9 GB,另一个110 KB)在独立模式下运行Spark,如下所示:
我有128 GB RAM和24个核心。我的配置是:
set("spark.executor.memory","64g")
set("spark.driver.memory","64g")
IntelliJ VM选项:-Xmx128G
从代码中可以看出,我已将数据分成1000个部分。我也分别尝试了5000和10000,因为countByKey
在我的情况下非常昂贵。
从其他一些StackOverflow帖子我看到spark.default.parallelism
选项。我该如何调整配置?我是否需要向IntelliJ VM选项添加更多内容?我应该使用spark.default.parallelism
吗?
val emp = sc.textFile("\\text1.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))}
val emp_new = sc.textFile("\\text2.txt",1000).map{line => val s = line.split("\t"); (s(3),s(1))}
val cog = emp.cogroup(emp_new)
val skk = cog.flatMap {
case (key: String, (l1: Iterable[String], l2: Iterable[String])) =>
for { e1 <- l1.toSeq; e2 <- l2.toSeq } yield ((e1, e2), 1)
}
val com = skk.countByKey()
当使用1000和5000个分区时,countByKey
会溢出太多溢出,当使用10000分区时,我开始得到一些结果,至少某些任务已完成。但过了一段时间我得到了如下所示的错误:
15/10/06 14:01:17 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 451457 ms exceeds timeout 120000 ms
15/10/06 14:01:17 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 451457 ms
15/10/06 14:01:17 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 2.0
15/10/06 14:01:17 WARN TaskSetManager: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost)
15/10/06 14:01:17 ERROR TaskSetManager: Task 109 in stage 2.0 failed 1 times; aborting job
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 91), so marking it as still running
15/10/06 14:01:17 WARN TaskSetManager: Lost task 34.0 in stage 2.0 (TID 20036, localhost): ExecutorLostFailure (executor driver lost)
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 118), so marking it as still running
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 100), so marking it as still running
15/10/06 14:01:17 INFO DAGScheduler: Resubmitted ShuffleMapTask(2, 76), so marking it as still running
...
15/10/06 14:01:17 INFO TaskSchedulerImpl: Cancelling stage 2
15/10/06 14:01:17 INFO DAGScheduler: ShuffleMapStage 2 (countByKey at ngram.scala:39) failed in 1020,915 s
15/10/06 14:01:17 INFO DAGScheduler: Job 0 failed: countByKey at ngram.scala:39, took 3025,563964 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 109 in stage 2.0 failed 1 times, most recent failure: Lost task 109.0 in stage 2.0 (TID 20111, localhost): ExecutorLostFailure (executor driver lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)