Spark-Master:已取消关联,将其删除

时间:2019-07-17 14:36:26

标签: apache-spark kubernetes

我正在部署具有1个主节点和3个工作节点的Spark集群。在部署主节点和工作节点之后,主节点开始向垃圾邮件发送以下消息;

19/07/17 12:56:51 INFO Master: I have been elected leader! New state: ALIVE
19/07/17 12:56:56 INFO Master: Registering worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:57 INFO Master: 172.26.140.163:59146 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.132:56252 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.194:62135 got disassociated, removing it.
19/07/17 12:57:02 INFO Master: Registering worker 172.26.140.169:44249 with 1 cores, 2.0 GB RAM
19/07/17 12:57:02 INFO Master: 172.26.140.163:59202 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.132:56355 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.194:62157 got disassociated, removing it.
19/07/17 12:57:07 INFO Master: 172.26.140.163:59266 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: 172.26.140.132:56376 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: Registering worker 172.26.140.204:43921 with 1 cores, 2.0 GB RAM
19/07/17 12:57:08 INFO Master: 172.26.140.194:62203 got disassociated, removing it.
19/07/17 12:57:12 INFO Master: 172.26.140.163:59342 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.132:56392 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.194:62268 got disassociated, removing it.
19/07/17 12:57:17 INFO Master: 172.26.140.163:59417 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.132:56415 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.194:62296 got disassociated, removing it.
19/07/17 12:57:22 INFO Master: 172.26.140.163:59472 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.132:56483 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.194:62323 got disassociated, removing it.

工作节点似乎已正确连接到主节点,并正在记录以下内容;

19/07/17 12:56:56 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
19/07/17 12:56:56 INFO Worker: Starting Spark worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:56 INFO Worker: Running Spark version 2.4.3
19/07/17 12:56:56 INFO Worker: Spark home: /opt/spark
19/07/17 12:56:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/07/17 12:56:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-worker-0.spark-worker-service.default.svc.cluster.local:8081
19/07/17 12:56:56 INFO Worker: Connecting to master spark-master-service.default.svc.cluster.local:7077...
19/07/17 12:56:56 INFO TransportClientFactory: Successfully created connection to spark-master-service.default.svc.cluster.local/10.0.179.236:7077 after 49 ms (0 ms spent in bootstraps)
19/07/17 12:56:56 INFO Worker: Successfully registered with master spark://172.26.140.196:7077

但是主服务器仍然每5秒记录三个独立节点的解除关联错误。

奇怪的是,Masters日志中列出的IP地址全部来自kube-proxy服务;

kube-system   kube-proxy-5vp9r                                     1/1     Running            0          39h     172.26.140.163   aks-agentpool-31454219-2   <none>           <none>
kube-system   kube-proxy-kl695                                     1/1     Running            0          39h     172.26.140.132   aks-agentpool-31454219-1   <none>           <none>
kube-system   kube-proxy-xgjws                                     1/1     Running            0          39h     172.26.140.194   aks-agentpool-31454219-0   <none>           <none>

我的问题有两个;

1)为什么kube-proxy节点连接到主服务器?还是为什么主节点认为kube-proxy节点正在参与此集群?

2)我需要更改什么设置才能从日志文件中清除此消息。

这是我的spark-defaults.conf文件的内容

spark.master=spark://spark-master-service:7077
spark.submit.deploy-mode=cluster
spark.executor.cores=1
spark.driver.memory=500m
spark.executor.memory=500m
spark.eventLog.enabled=true
spark.eventLog.dir=/mnt/eventLog

我找不到发生这种情况的任何有意义的原因,我们将不胜感激。

1 个答案:

答案 0 :(得分:0)

我在Kubernetes中的Spark集群遇到了相同的问题,测试了spark 2.4.3和Spark 2.4.4以及Kubernetes 16.0和13.0

这是解决方案:

这是我首先获得火花对象的方式

spark = SparkSession.builder.appName('Kubernetes-Spark-app').getOrCreate()

并且通过使用Spark主服务器的群集ip解决了该问题!

spark = SparkSession.builder.master('spark://10.0.106.83:7077').appName('Kubernetes-Spark-app').getOrCreate()

使用此图表

helm install microsoft/spark --generate-name