我正在部署具有1个主节点和3个工作节点的Spark集群。在部署主节点和工作节点之后,主节点开始向垃圾邮件发送以下消息;
19/07/17 12:56:51 INFO Master: I have been elected leader! New state: ALIVE
19/07/17 12:56:56 INFO Master: Registering worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:57 INFO Master: 172.26.140.163:59146 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.132:56252 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.194:62135 got disassociated, removing it.
19/07/17 12:57:02 INFO Master: Registering worker 172.26.140.169:44249 with 1 cores, 2.0 GB RAM
19/07/17 12:57:02 INFO Master: 172.26.140.163:59202 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.132:56355 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.194:62157 got disassociated, removing it.
19/07/17 12:57:07 INFO Master: 172.26.140.163:59266 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: 172.26.140.132:56376 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: Registering worker 172.26.140.204:43921 with 1 cores, 2.0 GB RAM
19/07/17 12:57:08 INFO Master: 172.26.140.194:62203 got disassociated, removing it.
19/07/17 12:57:12 INFO Master: 172.26.140.163:59342 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.132:56392 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.194:62268 got disassociated, removing it.
19/07/17 12:57:17 INFO Master: 172.26.140.163:59417 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.132:56415 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.194:62296 got disassociated, removing it.
19/07/17 12:57:22 INFO Master: 172.26.140.163:59472 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.132:56483 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.194:62323 got disassociated, removing it.
工作节点似乎已正确连接到主节点,并正在记录以下内容;
19/07/17 12:56:56 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
19/07/17 12:56:56 INFO Worker: Starting Spark worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:56 INFO Worker: Running Spark version 2.4.3
19/07/17 12:56:56 INFO Worker: Spark home: /opt/spark
19/07/17 12:56:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/07/17 12:56:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-worker-0.spark-worker-service.default.svc.cluster.local:8081
19/07/17 12:56:56 INFO Worker: Connecting to master spark-master-service.default.svc.cluster.local:7077...
19/07/17 12:56:56 INFO TransportClientFactory: Successfully created connection to spark-master-service.default.svc.cluster.local/10.0.179.236:7077 after 49 ms (0 ms spent in bootstraps)
19/07/17 12:56:56 INFO Worker: Successfully registered with master spark://172.26.140.196:7077
但是主服务器仍然每5秒记录三个独立节点的解除关联错误。
奇怪的是,Masters日志中列出的IP地址全部来自kube-proxy服务;
kube-system kube-proxy-5vp9r 1/1 Running 0 39h 172.26.140.163 aks-agentpool-31454219-2 <none> <none>
kube-system kube-proxy-kl695 1/1 Running 0 39h 172.26.140.132 aks-agentpool-31454219-1 <none> <none>
kube-system kube-proxy-xgjws 1/1 Running 0 39h 172.26.140.194 aks-agentpool-31454219-0 <none> <none>
我的问题有两个;
1)为什么kube-proxy节点连接到主服务器?还是为什么主节点认为kube-proxy节点正在参与此集群?
2)我需要更改什么设置才能从日志文件中清除此消息。
这是我的spark-defaults.conf文件的内容
spark.master=spark://spark-master-service:7077
spark.submit.deploy-mode=cluster
spark.executor.cores=1
spark.driver.memory=500m
spark.executor.memory=500m
spark.eventLog.enabled=true
spark.eventLog.dir=/mnt/eventLog
我找不到发生这种情况的任何有意义的原因,我们将不胜感激。
答案 0 :(得分:0)
我在Kubernetes中的Spark集群遇到了相同的问题,测试了spark 2.4.3和Spark 2.4.4以及Kubernetes 16.0和13.0
这是解决方案:
这是我首先获得火花对象的方式
spark = SparkSession.builder.appName('Kubernetes-Spark-app').getOrCreate()
并且通过使用Spark主服务器的群集ip解决了该问题!
spark = SparkSession.builder.master('spark://10.0.106.83:7077').appName('Kubernetes-Spark-app').getOrCreate()
使用此图表
helm install microsoft/spark --generate-name