我有一个Spark结构化的流应用程序,它使用来自kafka主题的消息。此应用程序在本地模式下工作正常(master = local)。现在,我想在Amazon EKS(kubernetes版本1.11)中使用的kubernetes集群中以集群模式运行它。对于部署,我们决定尝试使用Kubernetes Spark Operator和Spark submit。 在这两种情况下,都可以毫无问题地部署该应用程序,但是当第一个kafka消息被消耗掉时,我们会收到错误
BindException: Address not available: Service 'sparkDriver' failed".
执行程序中的日志如下:
2019-02-14 23:02:29 INFO Executor:54 - Adding file:/opt/spark/work-dir/./mammut-transducer-ktt.jar to class loader
2019-02-14 23:02:29 INFO TorrentBroadcast:54 - Started reading broadcast variable 0
2019-02-14 23:02:29 INFO TransportClientFactory:267 - Successfully created connection to spark-pi-1550185196340-driver-svc.default.svc/192.168.125.76:7279 after 2 ms (0 ms spent in bootstraps)
2019-02-14 23:02:30 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 7.3 KB, free 997.8 MB)
2019-02-14 23:02:30 INFO TorrentBroadcast:54 - Reading broadcast variable 0 took 161 ms
2019-02-14 23:02:30 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 17.4 KB, free 997.8 MB)
2019-02-14 23:02:30 INFO ConsumerConfig:279 - ConsumerConfig values:
auto.commit.interval.ms = 5000
...
2019-02-14 23:02:30 INFO AppInfoParser:109 - Kafka version : 2.0.0
2019-02-14 23:02:30 INFO AppInfoParser:110 - Kafka commitId : 3402a8361b734732
2019-02-14 23:02:31 INFO CodeGenerator:54 - Code generated in 582.151004 ms
2019-02-14 23:02:31 INFO CodeGenerator:54 - Code generated in 13.579402 ms
2019-02-14 23:02:31 INFO Metadata:273 - Cluster ID: K0XS-CasSt6MVY0r9NqCjg
2019-02-14 23:02:32 INFO SparkContext:54 - Running Spark version 2.4.0
2019-02-14 23:02:32 INFO SparkContext:54 - Submitted application: mammut-transducers-ktt
2019-02-14 23:02:32 INFO SecurityManager:54 - Changing view acls to: root
2019-02-14 23:02:32 INFO SecurityManager:54 - Changing modify acls to: root
2019-02-14 23:02:32 INFO SecurityManager:54 - Changing view acls groups to:
2019-02-14 23:02:32 INFO SecurityManager:54 - Changing modify acls groups to:
2019-02-14 23:02:32 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-02-14 23:02:32 WARN Utils:66 - Service 'sparkDriver' could not bind on port 7....
.
.
2019-02-14 23:02:32 ERROR SparkContext:91 - Error initializing SparkContext.
java.net.BindException: Address not available: Service 'sparkDriver' failed after 32 retries (starting from 7380)! Consider explicitly setting the appropriate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
驱动程序中的日志如下:
2019-02-14 23:02:27 INFO DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[5] at start at InterpreterMainService.scala:80) (first 15 tasks are for partitions Vector(0))
2019-02-14 23:02:27 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 1 tasks
2019-02-14 23:02:28 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, 192.168.82.130, executor 1, partition 0, PROCESS_LOCAL, 8803 bytes)
2019-02-14 23:02:30 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 192.168.82.130:7178 (size: 7.3 KB, free: 997.8 MB)
2019-02-14 23:02:32 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, 192.168.82.130, executor 1): java.net.BindException: Address not available: Service 'sparkDriver' failed after 32 retries (starting from 7380)! Consider explicitly setting the appropriate port for the service 'sparkDriver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1283)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
2019-02-14 23:02:32 INFO TaskSetManager:54 - Starting task 0.1 in stage 0.0 (TID 1, 192.168.82.130, executor 1, partition 0, PROCESS_LOCAL, 8803 bytes)
我尝试过更改引发使用的不同端口,并将executor-pod和driver-pod放在不同的kubernetes节点中。我还在kubernetes的默认名称空间和新名称空间中尝试过。所有这些更改都会产生完全相同的结果。
我还尝试运行操作员文档和spark文档中的示例,这些简单的Spark应用程序可以正常工作。但是它们都不是火花流应用程序。
在kubernetes中,driver-pod具有以下Env变量:
SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP)
执行者:
SPARK_DRIVER_URL: spark://CoarseGrainedScheduler@spark-pi-1550185196340-driver-svc.default.svc:7380
spark应用程序用户界面显示以下配置:
spark.blockManager.port 7178
spark.driver.bindAddress 192.168.125.76
spark.driver.blockManager.port 7279
spark.driver.host spark-pi-1550185196340-driver-svc.default.svc
spark.driver.port 7380
spark.master k8s://https://10.100.0.1:443
在执行程序日志中,我们可以看到执行程序与驱动程序进行通信:
2019-02-14 23:02:28 INFO Executor:54 - Fetching spark://spark-pi-1550185196340-driver-svc.default.svc:7380/jars/mammut-transducer-ktt.jar with timestamp 1550185238992
2019-02-14 23:02:28 INFO TransportClientFactory:267 - Successfully created connection to spark-pi-1550185196340-driver-svc.default.svc/192.168.125.76:7380 after 1 ms (0 ms spent in bootstraps)
有人想过为什么在流传输过程实际开始时执行器无法连接到“ spark.driver.port”?