我有一个使用Docker创建的Spark(DataStax企业)集群,使用docker-compose
捆绑在一起。这仅用于本地开发目的。
容器位于自己的docker网络中:172.18.0.0/16
。我正在使用运行Docker工具箱的Mac上,我能够直接从我的计算机访问容器,因为我已经通过172.18.0.0/16
手动添加了vboxnet0
的路由,虚拟网络Virtualbox在Mac上提供。
vboxnet0
界面的我方具有IP 192.168.99.1
。泊坞机方面有192.168.99.101
。
这一切都运行良好,主网页用户界面出现在172.18.0.2:7080
上,所有节点都显示正确,带有172.x
个IP地址(如果我通过扩展,则继续这样做例如docker-compose scale spark=5
)。
然而,当我提交工作时,例如:
$SPARK_HOME/bin/spark-submit --master spark://172.18.0.2:7077 --class myapp.Main \
./target/scala-2.10/myapp-assembly-1.0.0-SNAPSHOT.jar
,
这很慢(我假设是由于重试),我看到这样的错误,直到它最终成功:
16/09/16 13:01:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 5, 192.168.99.101): org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations. Most recent failure cause:
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at com.datastax.spark.connector.rdd.CassandraJoinRDD.compute(CassandraJoinRDD.scala:224)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /192.168.99.101:35306
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: /192.168.99.101:35306
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
... 1 more
但是,我不知道为什么它会尝试访问网关192.168.99.101
上的资源。
我也看到这样的输出,它再次没有显示我期望的IP地址:
16/09/16 13:01:36 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, 192.168.99.101, partition 1,PROCESS_LOCAL, 2316 bytes)
16/09/16 13:01:36 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 5, 192.168.99.101, partition 0,NODE_LOCAL, 2089 bytes)
16/09/16 13:01:36 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:39885 (size: 7.5 KB, free: 511.1 MB)
16/09/16 13:01:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:35306 (size: 7.5 KB, free: 511.0 MB)
我已尝试在Mac上设置SPARK_LOCAL_IP=192.168.99.1
,并在节点上的172.x
内设置每个节点的spark-env.sh
地址,但它没有帮助。< / p>
我可以直接从Mac访问所有节点,他们也可以路由到docker-machine
VM(192.168.99.101
)和Mac({{1} })。每个节点也可以通过名称和IP地址路由到其他节点。
我是否认为BlockManager似乎使用了错误的IP地址?有没有办法强制它使用正确的,而不是它以某种方式提升的网关地址?
编辑:只是为了添加 - 我还尝试将blockmanager端口设置为硬编码端口,因为很明显,随机分配的端口不会匹配任何我能192.168.99.1
的端口,无论是否IP地址是正确的,但似乎没有效果。我通过在EXPOSE
和-Dspark.blockManager.port=7005
中设置SPARK_MASTER_OPTS
来完成此操作。
编辑2:如果我将Java,Spark和我的App Jar放到一个新的空容器上(在相同的172.18 / 16网络上),并从那里通过SPARK_WORKER_OPTS
启动作业(即没有流量网关,只有容器到容器,一切都按预期工作。似乎是在从网关的另一端提交时使用主机的网关IP的一些问题。