Spark结构化流作业静默退出

时间:2019-05-21 02:58:03

标签: apache-spark hdfs yarn hortonworks-data-platform spark-structured-streaming

我有一个Spark结构化的流作业,该作业默默地死了,而应用程序日志中没有明确的错误消息。 它可以正常运行约10个小时,然后开始出现一些非致命错误消息。它持续产生结果约一天,然后驱动程序容器无声地死亡。

该作业在基于3节点HDP平台的群集中运行,并以纱线群集模式进行管理。它从Kafka提取数据,进行一些计算,然后将输出发送到Kafka和HDFS。

首先,我查看了驱动程序容器的yarn应用程序日志,并发现了以下错误消息:

19/05/19 21:02:08 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: curr
ent=[DatanodeInfoWithStorage[10.8.0.247:50010,DS-6502520b-5b78-408b-b18d-a99df4fb76ab,DISK], DatanodeInfoWithStorage[10.8.0.145:50010,DS-d8133dc8
-cfaa-406d-845d-c819186c1450,DISK]], original=[DatanodeInfoWithStorage[10.8.0.247:50010,DS-6502520b-5b78-408b-b18d-a99df4fb76ab,DISK], DatanodeIn
foWithStorage[10.8.0.145:50010,DS-d8133dc8-cfaa-406d-845d-c819186c1450,DISK]]). The current failed datanode replacement policy is DEFAULT, and a
client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1059)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1122)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1280)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1005)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:512)

End of LogType:stderr
***********************************************************************

以上是驱动程序的最后一条消息。

这看起来很可怕,但这项工作一天产生了36,628个此类错误的结果,因此它并未导致该工作直接死亡。 HDFS系统似乎也可以正常工作。

然后,我查看了执行程序日志。他们在驾驶员死亡后退出,不包含任何错误或异常:

19/05/19 21:02:09 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver ip-10-8-0-247.us-west-2.compute.internal:11269 disass
ociated! Shutting down.

我找不到原因,所以我查看了纱线资源管理器日志,找到了以下消息:

2019-05-19 18:36:44,047 INFO  availability.MetricSinkWriteShardHostnameHashingStrategy (MetricSinkWriteShardHostnameHashingStrategy.java:findColl
ectorShard(42)) - Calculated collector shard ip-10-8-0-145.us-west-2.compute.internal based on hostname: ip-10-8-0-145.us-west-2.compute.internal
2019-05-19 19:48:04,041 INFO  availability.MetricSinkWriteShardHostnameHashingStrategy (MetricSinkWriteShardHostnameHashingStrategy.java:findColl
ectorShard(42)) - Calculated collector shard ip-10-8-0-145.us-west-2.compute.internal based on hostname: ip-10-8-0-145.us-west-2.compute.internal
2019-05-19 21:02:08,797 INFO  rmcontainer.RMContainerImpl (RMContainerImpl.java:handle(422)) - container_e01_1557249464624_0669_01_000001 Contain
er Transitioned from RUNNING to COMPLETED
2019-05-19 21:02:08,797 INFO  scheduler.SchedulerNode (SchedulerNode.java:releaseContainer(220)) - Released container container_e01_1557249464624
_0669_01_000001 of capacity <memory:1024, vCores:1> on host ip-10-8-0-247.us-west-2.compute.internal:45454, which currently has 7 containers, <me
mory:19968, vCores:7> used and <memory:2560, vCores:1> available, release resources=true
2019-05-19 21:02:08,798 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:rememberTargetTransitionsAndStoreState(1209)) - Updating applicatio
n attempt appattempt_1557249464624_0669_000001 with final state: FAILED, and exit status: -104
2019-05-19 21:02:08,798 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1557249464624_0669_000001 State change fr
om RUNNING to FINAL_SAVING
2019-05-19 21:02:08,798 INFO  integration.RMRegistryOperationsService (RMRegistryOperationsService.java:onContainerFinished(143)) - Container con
tainer_e01_1557249464624_0669_01_000001 finished, skipping purging container-level records (should be handled by AM)
2019-05-19 21:02:08,801 INFO  resourcemanager.ApplicationMasterService (ApplicationMasterService.java:unregisterAttempt(685)) - Unregistering app
 attempt : appattempt_1557249464624_0669_000001
2019-05-19 21:02:08,801 INFO  security.AMRMTokenSecretManager (AMRMTokenSecretManager.java:applicationMasterFinished(124)) - Application finished
, removing password for appattempt_1557249464624_0669_000001
2019-05-19 21:02:08,801 INFO  attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(809)) - appattempt_1557249464624_0669_000001 State change fr
om FINAL_SAVING to FAILED
2019-05-19 21:02:08,801 INFO  rmapp.RMAppImpl (RMAppImpl.java:transition(1331)) - The number of failed attempts is 1. The max attempts is 2
2019-05-19 21:02:08,801 INFO  rmapp.RMAppImpl (RMAppImpl.java:handle(779)) - application_1557249464624_0669 State change from RUNNING to ACCEPTED
2019-05-19 21:02:08,801 INFO  capacity.CapacityScheduler (CapacityScheduler.java:doneApplicationAttempt(812)) - Application Attempt appattempt_15
57249464624_0669_000001 is done. finalState=FAILED

看起来像毛线也没有杀死工作。驱动程序容器突然从“正在运行”变为“已完成”。

我希望看到一些类似OOM的显式消息导致作业崩溃,但是现在我迷惑了为什么它会以静默方式退出。是否与HDFS错误有任何关系?当异常太多(即使它们不是致命的)时,Spark中是否有任何机制可以静默停止驱动程序?欢迎任何意见,谢谢!

2 个答案:

答案 0 :(得分:0)

纱线出口代码-104表示physical memory limits for that Yarn container were exceeded

  

容器由于超出分配的物理内存限制而终止。

在AWS上运行时,可以为驱动程序节点使用更高的RAM实例类型。

答案 1 :(得分:0)

请在下面的链接中查看详细信息-

Ref: Bad DataNode Failure Issue Hortonworks-

原因:- 当我们在小型集群(数据节点少于5个的集群)上运行作业且数据负载繁重时,会发生此问题。如果写入管道中出现数据节点/网络故障,DFSClient将尝试从管道中删除发生故障的数据节点,然后继续使用其余的数据节点进行写入。结果,减少了流水线中数据节点的数量。以下说明的属性可帮助我们解决该问题。

解决方案:-请如下更改DataNode替换策略-

要解决此问题,请从Ambari> HDFS>配置>自定义HDFS站点>添加属性中设置以下两个属性:

dfs.client.block.write.replace-datanode-on-failure.enable=NEVER
dfs.client.block.write.replace-datanode-on-failure.policy=NEVER