pyspark在谷歌数据中心失败了

时间:2016-05-05 02:17:35

标签: apache-spark pyspark google-cloud-dataproc

我的工作因以下日志而失败,但是,我并不完全理解。它似乎是由

引起的

" YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 24.7 GB of 24 GB physical"。

但是如何增加谷歌数据中心的内存。

日志:

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 332.0 in stage 0.0 (TID 332, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 335.0 in stage 0.0 (TID 335, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 329.0 in stage 0.0 (TID 329, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Traceback (most recent call last):
  File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 127, in <module>
    main()
  File "/tmp/5d6059b8-f9f4-4be6-9005-76c29a27af17/fetch.py", line 121, in main
    d.saveAsTextFile('gs://ll_hang/decahose-hashtags/data-multi3')
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1506, in saveAsTextFile
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o50.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 191 in stage 0.0 failed 4 times, most recent failure: Lost task 191.3 in stage 0.0 (TID 483, cluster-4-w-40.c.ll-1167.internal): ExecutorLostFailure (executor 114 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 25.2 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1213)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1156)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1060)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1026)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:952)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:951)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1457)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436)
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1436)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1436)
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:507)
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:46)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)

16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 280.1 in stage 0.0 (TID 475, cluster-4-w-3.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 283.1 in stage 0.0 (TID 474, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 0.0 (TID 10, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 9.0 in stage 0.0 (TID 9, cluster-4-w-95.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 184.1 in stage 0.0 (TID 463, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 81.0 in stage 0.0 (TID 81, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 85.0 in stage 0.0 (TID 85, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 84.0 in stage 0.0 (TID 84, cluster-4-w-60.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@27cb5c01,null)
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 438.1 in stage 0.0 (TID 442, cluster-4-w-23.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@71f24e3e,null)
16/05/05 01:12:42 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 114 idle
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 97.0 in stage 0.0 (TID 97, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 102.0 in stage 0.0 (TID 102, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@2ed7b1d,null)
16/05/05 01:12:42 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(0,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@1b339b4f,null)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 190.1 in stage 0.0 (TID 461, cluster-4-w-67.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 111.0 in stage 0.0 (TID 111, cluster-4-w-74.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 101.0 in stage 0.0 (TID 101, cluster-4-w-50.c.ll-1167.internal): TaskKilled (killed intentionally)
16/05/05 01:12:42 ERROR org.apache.spark.network.server.TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler or it has been stopped.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
    at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:578)
    at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:170)
    at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:104)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
    at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    at java.lang.Thread.run(Thread.java:745)
16/05/05 01:12:42 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

1 个答案:

答案 0 :(得分:5)

在Dataproc中,Spark被配置为每半个计算机打包1个执行程序,然后执行程序并行运行多个任务,具体取决于计算机一半中可用的核心数。例如,在n1-standard-4上,您希望每个执行程序使用2个核心,从而一次并行运行两个任务。内存同样被分割,虽然一些内存也保留给守护进程服务,一些内存给YARN执行程序开销等等。

这通常意味着您可以使用一些选项来增加每个任务的内存:

  1. 您可以一次将spark.executor.cores减少1,最低为1;因为这使spark.executor.memory保持不变,实际上每个并行任务现在共享每个执行程序内存的更大部分。例如,在n1-standard-8上,默认设置为spark.executor.cores=4,执行程序内存大约为12GB左右,因此每个“任务”都可以使用~3GB内存。如果设置spark.executor.cores=3,则执行程序内存为12GB,现在每个任务都会达到~4GB。您至少可以尝试将其降低到spark.executor.cores=1,看看这种方法是否可行;然后只要作业仍能成功确保良好的CPU利用率,就增加它。你可以在工作提交时这样做:

    gcloud dataproc jobs submit pyspark --properties spark.executor.cores=1 ...
    
  2. 或者,你可以提高spark.executor.memory;只需使用gcloud dataproc clusters describe cluster-4查看您的群集资源,您就会看到当前设置。

  3. 如果您不想浪费核心,可能需要尝试不同的机器类型。例如,如果您当前正在使用n1-standard-8,请尝试使用n1-highmem-8。 Dataproc仍为每个执行程序提供半个机器,因此每个执行程序最终会有更多内存。您还可以使用custom machine types微调内存与CPU的平衡。