我在EC2上设置了一个带有20个节点的火花簇,并将所有节点IP设置在主站的conf / slave中,并使用SparkR和50个切片启动了一个作业。我的节点是带有4GB内存的双核,在我的工作结束时将结果收集到一个csv文件中,该文件应该包含大约15000行(和7列浮点数)。作业运行一段时间(6000s),直到我从主站获得以下错误(这不是来自spakr主日志,而是来自我执行spark作业的终端窗口):
16/03/21 22:39:31 INFO TaskSetManager: Finished task 27.0 in stage 0.0 (TID 27) in 5954810 ms on ip-xxx-yy-xx-zzz.somewhere.compute.internal (8/40)
16/03/21 22:39:38 INFO TaskSetManager: Finished task 12.0 in stage 0.0 (TID 12) in 5962190 ms on ip-xxx-xx-xx-xxx.somewhere.compute.internal (9/40)
Error in if (returnStatus != 0) { : argument is of length zero
Calls: <Anonymous> -> <Anonymous> -> .local -> callJMethod -> invokeJava
Execution halted
16/03/21 22:40:16 INFO SparkContext: Invoking stop() from shutdown hook
16/03/21 22:40:16 INFO SparkUI: Stopped Spark web UI at http://172.31.21.134:4040
16/03/21 22:40:16 INFO DAGScheduler: Job 0 failed: collect at NativeMethodAccessorImpl.java:-2, took 6001.135894 s
16/03/21 22:40:16 INFO DAGScheduler: ShuffleMapStage 0 (RDD at RRDD.scala:36) failed in 6000.500 s
16/03/21 22:40:16 ERROR RBackendHandler: collect on 16 failed
16/03/21 22:40:16 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@6c9d21b2)
16/03/21 22:40:16 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1458600016592,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
16/03/21 22:40:16 INFO SparkDeploySchedulerBackend: Shutting down all executors
我检查了工作日志,并在日志文件的末尾看到以下两行:
16/03/21 22:40:16 INFO CoarseGrainedExecutorBackend: Driver commanded
a shutdown 16/03/21 22:40:16 ERROR CoarseGrainedExecutorBackend:
RECEIVED SIGNAL 15: SIGTERM
然后日志突然停止(之前没有其他错误或警告)。 我没有看到任何可能导致日志文件崩溃的暗示,我唯一的猜测是它可能是一个内存不足的错误,因为当我在减少的输入数据集上运行它运行正常。我错过了什么吗?