Datastax Spark工作无故杀死

时间:2016-03-28 21:47:12

标签: apache-spark datastax

我们正在使用DSE Spark和一个运行5个作业的3节点集群。我们看到SIGTERM命令进入/var/log/spark/worker/worker-0/worker.log,它正在停止我们的工作。在这些时间内,我们没有看到任何相应的内存或处理器限制,也没有人手动进行这些调用。

我已经看到了几个类似的问题导致了YARN或Mesos的堆大小问题,但由于我们使用DSE,这些似乎并不相关。

以下是来自1台服务器的日志信息示例,该服务器正在运行2个作业:

ERROR [SIGTERM handler] 2016-03-26 00:43:28,780 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
ERROR [SIGHUP handler] 2016-03-26 00:43:28,788 SignalLogger.scala:57 - RECEIVED SIGNAL 1: SIGHUP
INFO  [Spark Shutdown Hook] 2016-03-26 00:43:28,795 Logging.scala:59 - Killing process!
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr] 2016-03-26 00:43:28,848 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stderr
java.io.IOException: Stream closed
        at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
        at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
        at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [File appending thread for /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout] 2016-03-26 00:43:28,892 Logging.scala:96 - Error writing stream to file /var/lib/spark/worker/worker-0/app-20160325131848-0001/0/stdout
java.io.IOException: Stream closed
        at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) ~[na:1.8.0_71]
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:283) ~[na:1.8.0_71]
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_71]
        at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_71]
        at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70) ~[spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
        at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38) [spark-core_2.10-1.4.1.3.jar:1.4.1.3]
ERROR [SIGTERM handler] 2016-03-26 00:43:29,070 SignalLogger.scala:57 - RECEIVED SIGNAL 15: SIGTERM
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,079 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker@10.0.1.7:44131] -> [akka.tcp://sparkMaster@10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,080 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,081 Logging.scala:59 - Connecting to master akka.tcp://sparkMaster@10.0.1.7:7077/user/Master...
WARN  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,091 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkMaster@10.0.1.7:7077] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,101 Logging.scala:59 - Disassociated [akka.tcp://sparkWorker@10.0.1.7:44131] -> [akka.tcp://sparkMaster@10.0.1.7:7077] Disassociated !
ERROR [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:75 - Connection to master failed! Waiting for master to reconnect...
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,102 Logging.scala:59 - Not spawning another attempt to register with the master, since there is an attempt scheduled already.
WARN  [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,323 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor@10.0.1.7:49943] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,330 Logging.scala:59 - Executor app-20160325132151-0004/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO  [Spark Shutdown Hook] 2016-03-26 00:43:29,414 Logging.scala:59 - Killing process!
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,415 Logging.scala:59 - Executor app-20160325131848-0001/0 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO  [Spark Shutdown Hook] 2016-03-26 00:43:29,417 Logging.scala:59 - Killing process!
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,422 Logging.scala:59 - Unknown Executor app-20160325132151-0004/0 finished with state EXITED message Worker shutting down exitStatus 129
WARN  [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,425 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor@10.0.1.7:32874] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
WARN  [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,433 Slf4jLogger.scala:71 - Association with remote system [akka.tcp://sparkExecutor@10.0.1.7:56212] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
INFO  [sparkWorker-akka.actor.default-dispatcher-3] 2016-03-26 00:43:29,441 Logging.scala:59 - Executor app-20160325131918-0002/1 finished with state EXITED message Command exited with code 129 exitStatus 129
INFO  [sparkWorker-akka.actor.default-dispatcher-4] 2016-03-26 00:43:29,448 Logging.scala:59 - Unknown Executor app-20160325131918-0002/1 finished with state EXITED message Worker shutting down exitStatus 129
INFO  [Spark Shutdown Hook] 2016-03-26 00:43:29,448 Logging.scala:59 - Shutdown hook called
INFO  [Spark Shutdown Hook] 2016-03-26 00:43:29,449 Logging.scala:59 - Deleting directory /var/lib/spark/rdd/spark-28fa2f73-d2aa-44c0-ad4e-3ccfd07a95d2

1 个答案:

答案 0 :(得分:0)

错误似乎对我很直接

  

将流写入文件/ var / lib / spark / worker / worker-0 / app-20160325131848-0001 / 0 / stdout java.io.IOException时出错:在java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java)中关闭了流:170)

在这里,您的数据源(Cassandra)和spark之间存在网络问题。记住,在node1上实际上Spark可以/将从cassandra的node2中提取数据,尽管它试图将其最小化。

或者,您的序列化存在问题。因此,在spark配置中添加此参数以切换到Kryo。

spark.serializer=org.apache.spark.serializer.KryoSerializer