Spark节点使用错误的IP地址进行通信(Docker)

时间:2016-09-16 12:15:08

标签: apache-spark docker docker-compose datastax-enterprise

我有一个使用Docker创建的Spark(DataStax企业)集群,使用docker-compose捆绑在一起。这仅用于本地开发目的。

容器位于自己的docker网络中:172.18.0.0/16。我正在使用运行Docker工具箱的Mac上,我能够直接从我的计算机访问容器,因为我已经通过172.18.0.0/16手动添加了vboxnet0的路由,虚拟网络Virtualbox在Mac上提供。

vboxnet0界面的我方具有IP 192.168.99.1。泊坞机方面有192.168.99.101

这一切都运行良好,主网页用户界面出现在172.18.0.2:7080上,所有节点都显示正确,带有172.x个IP地址(如果我通过扩展,则继续这样做例如docker-compose scale spark=5)。

然而,当我提交工作时,例如:

$SPARK_HOME/bin/spark-submit --master spark://172.18.0.2:7077 --class myapp.Main \ ./target/scala-2.10/myapp-assembly-1.0.0-SNAPSHOT.jar

这很慢(我假设是由于重试),我看到这样的错误,直到它最终成功:

16/09/16 13:01:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 5, 192.168.99.101): org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 locations. Most recent failure cause:
    at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
    at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
    at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
    at org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at com.datastax.spark.connector.rdd.CassandraJoinRDD.compute(CassandraJoinRDD.scala:224)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /192.168.99.101:35306
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    ... 3 more
Caused by: java.net.ConnectException: Connection refused: /192.168.99.101:35306
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
    ... 1 more

但是,我不知道为什么它会尝试访问网关192.168.99.101上的资源。

我也看到这样的输出,它再次没有显示我期望的IP地址:

16/09/16 13:01:36 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 4, 192.168.99.101, partition 1,PROCESS_LOCAL, 2316 bytes)
16/09/16 13:01:36 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 5, 192.168.99.101, partition 0,NODE_LOCAL, 2089 bytes)
16/09/16 13:01:36 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:39885 (size: 7.5 KB, free: 511.1 MB)
16/09/16 13:01:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.99.101:35306 (size: 7.5 KB, free: 511.0 MB)

我已尝试在Mac上设置SPARK_LOCAL_IP=192.168.99.1,并在节点上的172.x内设置每个节点的spark-env.sh地址,但它没有帮助。< / p>

我可以直接从Mac访问所有节点,他们也可以路由到docker-machine VM(192.168.99.101)和Mac({{1} })。每个节点也可以通过名称和IP地址路由到其他节点。

我是否认为BlockManager似乎使用了错误的IP地址?有没有办法强制它使用正确的,而不是它以某种方式提升的网关地址?

编辑:只是为了添加 - 我还尝试将blockmanager端口设置为硬编码端口,因为很明显,随机分配的端口不会匹配任何我能192.168.99.1的端口,无论是否IP地址是正确的,但似乎没有效果。我通过在EXPOSE-Dspark.blockManager.port=7005中设置SPARK_MASTER_OPTS来完成此操作。

编辑2:如果我将Java,Spark和我的App Jar放到一个新的空容器上(在相同的172.18 / 16网络上),并从那里通过SPARK_WORKER_OPTS启动作业(即没有流量网关,只有容器到容器,一切都按预期工作。似乎是在从网关的另一端提交时使用主机的网关IP的一些问题。

0 个答案:

没有答案