Kubernetes-Spark-如何解决“ BindException:地址不可用:服务'sparkDriver'失败”?

时间:2019-02-14 23:54:17

标签: apache-spark kubernetes kubernetes-helm amazon-eks aws-eks

我有一个Spark结构化的流应用程序,它使用来自kafka主题的消息。此应用程序在本地模式下工作正常(master = local)。现在,我想在Amazon EKS(kubernetes版本1.11)中使用的kubernetes集群中以集群模式运行它。对于部署,我们决定尝试使用Kubernetes Spark OperatorSpark submit。 在这两种情况下,都可以毫无问题地部署该应用程序,但是当第一个kafka消息被消耗掉时,我们会收到错误

BindException: Address not available: Service 'sparkDriver' failed".

执行程序中的日志如下:

2019-02-14 23:02:29 INFO  Executor:54 - Adding file:/opt/spark/work-dir/./mammut-transducer-ktt.jar to class loader
2019-02-14 23:02:29 INFO  TorrentBroadcast:54 - Started reading broadcast variable 0
2019-02-14 23:02:29 INFO  TransportClientFactory:267 - Successfully created connection to spark-pi-1550185196340-driver-svc.default.svc/192.168.125.76:7279 after 2 ms (0 ms spent in bootstraps)
2019-02-14 23:02:30 INFO  MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.3 KB, free 997.8 MB)
2019-02-14 23:02:30 INFO  TorrentBroadcast:54 - Reading broadcast variable 0 took 161 ms
2019-02-14 23:02:30 INFO  MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 17.4 KB, free 997.8 MB)
2019-02-14 23:02:30 INFO  ConsumerConfig:279 - ConsumerConfig values: 
    auto.commit.interval.ms = 5000
...
2019-02-14 23:02:30 INFO  AppInfoParser:109 - Kafka version : 2.0.0
2019-02-14 23:02:30 INFO  AppInfoParser:110 - Kafka commitId : 3402a8361b734732
2019-02-14 23:02:31 INFO  CodeGenerator:54 - Code generated in 582.151004 ms
2019-02-14 23:02:31 INFO  CodeGenerator:54 - Code generated in 13.579402 ms
2019-02-14 23:02:31 INFO  Metadata:273 - Cluster ID: K0XS-CasSt6MVY0r9NqCjg
2019-02-14 23:02:32 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-02-14 23:02:32 INFO  SparkContext:54 - Submitted application: mammut-transducers-ktt
2019-02-14 23:02:32 INFO  SecurityManager:54 - Changing view acls to: root
2019-02-14 23:02:32 INFO  SecurityManager:54 - Changing modify acls to: root
2019-02-14 23:02:32 INFO  SecurityManager:54 - Changing view acls groups to: 
2019-02-14 23:02:32 INFO  SecurityManager:54 - Changing modify acls groups to: 
2019-02-14 23:02:32 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2019-02-14 23:02:32 WARN  Utils:66 - Service 'sparkDriver' could not bind on port 7....
.
.
2019-02-14 23:02:32 ERROR SparkContext:91 - Error initializing SparkContext.
java.net.BindException: Address not available: Service 'sparkDriver' failed after 32 retries (starting from 7380)! Consider explicitly setting the appropriate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
    at sun.nio.ch.Net.bind0(Native Method)

驱动程序中的日志如下:

2019-02-14 23:02:27 INFO  DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at start at InterpreterMainService.scala:80) (first 15 tasks are for partitions Vector(0))
2019-02-14 23:02:27 INFO  TaskSchedulerImpl:54 - Adding task set 0.0 with 1 tasks
2019-02-14 23:02:28 INFO  TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 192.168.82.130, executor 1, partition 0, PROCESS_LOCAL, 8803 bytes)
2019-02-14 23:02:30 INFO  BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 192.168.82.130:7178 (size: 7.3 KB, free: 997.8 MB)
2019-02-14 23:02:32 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 192.168.82.130, executor 1): java.net.BindException: Address not available: Service 'sparkDriver' failed after 32 retries (starting from 7380)! Consider explicitly setting the appropriate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:433)
    at sun.nio.ch.Net.bind(Net.java:425)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
    at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
    at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
    at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
    at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
    at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
    at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
    at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
 2019-02-14 23:02:32 INFO  TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 1, 192.168.82.130, executor 1, partition 0, PROCESS_LOCAL, 8803 bytes)

我尝试过更改引发使用的不同端口,并将executor-pod和driver-pod放在不同的kubernetes节点中。我还在kubernetes的默认名称空间和新名称空间中尝试过。所有这些更改都会产生完全相同的结果。

我还尝试运行操作员文档和spark文档中的示例,这些简单的Spark应用程序可以正常工作。但是它们都不是火花流应用程序。

在kubernetes中,driver-pod具有以下Env变量:

SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)

执行者:

SPARK_DRIVER_URL: spark://CoarseGrainedScheduler@spark-pi-1550185196340-driver-svc.default.svc:7380

spark应用程序用户界面显示以下配置:

spark.blockManager.port         7178
spark.driver.bindAddress    192.168.125.76
spark.driver.blockManager.port  7279
spark.driver.host           spark-pi-1550185196340-driver-svc.default.svc
spark.driver.port           7380
spark.master                    k8s://https://10.100.0.1:443

在执行程序日志中,我们可以看到执行程序与驱动程序进行通信:

2019-02-14 23:02:28 INFO  Executor:54 - Fetching spark://spark-pi-1550185196340-driver-svc.default.svc:7380/jars/mammut-transducer-ktt.jar with timestamp 1550185238992
2019-02-14 23:02:28 INFO  TransportClientFactory:267 - Successfully created connection to spark-pi-1550185196340-driver-svc.default.svc/192.168.125.76:7380 after 1 ms (0 ms spent in bootstraps)

有人想过为什么在流传输过程实际开始时执行器无法连接到“ spark.driver.port”?

0 个答案:

没有答案