我在12台机器上设置了一个集群,从机上的火花工人可以每天与主机解除关联。这意味着他们可以在一天中看起来工作一段时间,但随后奴隶会所有解除关联,然后被关闭。
工作人员的日志如下所示:
16/03/07 12:45:34.828 INFO Worker: Retrying connection to master (attempt # 1)
16/03/07 12:45:34.830 INFO Worker: Connecting to master host1:7077...
16/03/07 12:45:34.826 INFO Worker: Retrying connection to master (attempt # 2)
16/03/07 12:45:45.830 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-2,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@1c5651e9 rejected from java.util.concurrent.ThreadPoolExecutor@671ba687[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 2]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
...
16/03/07 12:45:45.853 Info ExecutorRunner: Killing process!
16/03/07 12:45:45.855 INFO ShutdownHookManager: Shutdown hook called
主人的日志如下所示:
16/03/07 12:45:45.878 INFO Master:10.126.217.11:51502已取消关联,将其删除。
16/03/07 12:45:45.878 INFO Master:删除工人 - 20160303035822-10.126.217.11-51502,电话:10.126.217.11:51502
机器信息:
每台机器40个核心和256GB内存
火花版:1.5.1
java版本:1.8.0_45
spark集群在此集群上运行,配置如下:
spark.cores.max=360
spark.executor.memory=32g
它是从机还是主机上的内存问题?
或者它是奴隶和主机之间的网络问题?
还是其他任何问题?
请告知。
感谢