执行者心跳超时使用DataProc上的Spark

时间:2016-09-03 16:16:04

标签: apache-spark apache-spark-ml google-cloud-dataproc

我正在尝试在Google DataProc群集上安装Spark(2.0.0)中的ml模型。在拟合模型时,我收到Executor心跳超时错误。我该如何解决这个问题?

其他解决方案表明这可能是由于执行者(其中一个)的内存不足造成的。我读作解决方案:设置正确的设置,重新分区,缓存,并获得更大的集群。我该怎么办,最好不要设置更大的群集? (制作更多/更少的分区?缓存更少?调整设置?)

我的设置:

Google DataProc群集上的Spark 2.0.0: 1名主人和2名工人都具有相同的规格:n1-highmem-8 - > 8个vCPU,52.0 GB内存 - 500GB磁盘

设定:

spark\:spark.executor.cores=1
distcp\:mapreduce.map.java.opts=-Xmx2457m
spark\:spark.driver.maxResultSize=1920m
mapred\:mapreduce.map.java.opts=-Xmx2457m
yarn\:yarn.nodemanager.resource.memory-mb=6144
mapred\:mapreduce.reduce.memory.mb=6144
spark\:spark.yarn.executor.memoryOverhead=384
mapred\:mapreduce.map.cpu.vcores=1
distcp\:mapreduce.reduce.memory.mb=6144
mapred\:yarn.app.mapreduce.am.resource.mb=6144
mapred\:mapreduce.reduce.java.opts=-Xmx4915m
yarn\:yarn.scheduler.maximum-allocation-mb=6144
dataproc\:dataproc.scheduler.max-concurrent-jobs=11
dataproc\:dataproc.heartbeat.master.frequency.sec=30
mapred\:mapreduce.reduce.cpu.vcores=2
distcp\:mapreduce.reduce.java.opts=-Xmx4915m
distcp\:mapreduce.map.memory.mb=3072
spark\:spark.driver.memory=3840m
mapred\:mapreduce.map.memory.mb=3072
yarn\:yarn.scheduler.minimum-allocation-mb=512
mapred\:yarn.app.mapreduce.am.resource.cpu-vcores=2
spark\:spark.yarn.am.memoryOverhead=384
spark\:spark.executor.memory=2688m
spark\:spark.yarn.am.memory=2688m
mapred\:yarn.app.mapreduce.am.command-opts=-Xmx4915m

完全错误:

Py4JJavaError:调用o4973.fit时发生错误。 :org.apache.spark.SparkException:由于阶段失败导致作业中止:阶段16964.0中的任务151失败4次,最近失败:阶段16964.0中丢失任务151.3(TID 779444,reco-test-w-0.c.datasetredouteasvendor .internal):ExecutorLostFailure(执行者14退出由其中一个正在运行的任务引起)原因:执行者心跳在175122 ms后超时 驱动程序堆栈跟踪:     在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1450)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1438)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1437)     在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811)     在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:811)     在scala.Option.foreach(Option.scala:257)     在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)     在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)     在org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:893)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)     在org.apache.spark.rdd.RDD.collect(RDD.scala:892)     在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ countByKey $ 1.apply(PairRDDFunctions.scala:372)     在org.apache.spark.rdd.PairRDDFunctions $$ anonfun $ countByKey $ 1.apply(PairRDDFunctions.scala:372)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)     在org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:371)     在org.apache.spark.rdd.RDD $$ anonfun $ countByValue $ 1.apply(RDD.scala:1156)     在org.apache.spark.rdd.RDD $$ anonfun $ countByValue $ 1.apply(RDD.scala:1156)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)     在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)     在org.apache.spark.rdd.RDD.withScope(RDD.scala:358)     在org.apache.spark.rdd.RDD.countByValue(RDD.scala:1155)     at org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:91)     在org.apache.spark.ml.feature.StringIndexer.fit(StringIndexer.scala:66)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)     在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)     在py4j.Gateway.invoke(Gateway.java:280)     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)     在py4j.commands.CallCommand.execute(CallCommand.java:79)     在py4j.GatewayConnection.run(GatewayConnection.java:211)     在java.lang.Thread.run(Thread.java:745)

1 个答案:

答案 0 :(得分:2)

由于这个问题没有答案,总结一下这个问题似乎与spark.executor.memory设置得太低有关,导致执行人偶尔出现内存不足错误。

建议的修复方法是首先尝试使用默认的Dataproc配置,该配置尝试完全使用实例上可用的所有内核和内存。如果问题仍然存在,请调整spark.executor.memoryspark.executor.cores以增加每个任务的可用内存量(主要是spark.executor.memory / spark.executor.cores)。

Dennis还在以下答案中提供了有关Dataproc上Spark内存配置的更多详细信息:
Google Cloud Dataproc configuration issues