如何在spark RetryingBlockFetcher IOException中设置remoteHost

时间:2015-03-06 18:39:48

标签: scala apache-spark cluster-computing

我为这么长的帖子道歉,但我想更好地理解。

我已经建立了我的集群,其中主人在另一台机器上而不是工人。工人被分配在一台非常有效的机器上。在这两台机器之间没有应用防火墙。

URL: spark://MASTER_IP:7077
Workers: 10
Cores: 10 Total, 0 Used
Memory: 40.0 GB Total, 0.0 B Used
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE

在启动应用程序之前,在workers文件中是(一个工作者的示例)

15/03/06 18:52:19 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/03/06 18:52:19 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:52:19 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:52:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:52:20 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:52:20 INFO Remoting: Starting remoting
15/03/06 18:52:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@WORKER_MACHINE_IP:42240]
15/03/06 18:52:20 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkWorker@WORKER_MACHINE_IP:42240]
15/03/06 18:52:20 INFO Utils: Successfully started service 'sparkWorker' on port 42240.
15/03/06 18:52:20 INFO Worker: Starting Spark worker WORKER_MACHINE_IP:42240 with 1 cores, 4.0 GB RAM
15/03/06 18:52:20 INFO Worker: Spark home: /home/szymon/spark
15/03/06 18:52:20 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/03/06 18:52:20 INFO WorkerWebUI: Started WorkerWebUI at http://WORKER_MACHINE_IP:8081
15/03/06 18:52:20 INFO Worker: Connecting to master spark://MASTER_IP:7077...
15/03/06 18:52:20 INFO Worker: Successfully registered with master spark://MASTER_IP:7077

我在群集上(在主机上)启动我的应用程序

./bin/spark-submit --class SimpleApp --master spark://MASTER_IP:7077 --executor-memory 3g --total-executor-cores 10 code/trial_2.11-0.9.jar

然后工作人员提取我的应用程序,这是工作人员的日志输出示例(@WORKER_MACHINE

15/03/06 18:07:45 INFO ExecutorRunner: Launch command: "/usr/java/jdk1.8.0_31/bin/java" "-cp" "::/home/machine/spark/conf:/home/machine/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-hadoop2.4.0.jar" "-Dspark.driver.port=56753" "-Dlog4j.configuration=file:////home/machine/spark/conf/log4j.properties" "-Dspark.driver.host=MASTER_IP" "-Xms3072M" "-Xmx3072M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver@MASTER_IP:56753/user/CoarseGrainedScheduler" "4" "WORKER_MACHINE_IP" "1" "app-20150306181450-0000" "akka.tcp://sparkWorker@WORKER_MACHINE_IP:45288/user/Worker"

该应用想要在地址localhost而不是127.0.0.1(我相信)连接到MASTER_IP。 怎么可能修好?

15/03/06 18:58:52 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) 

问题是由TransportClientFactory中的createClient方法导致的,spark-network-common_2.10-1.2.1-sources.jarString remoteHost设置为localhost

/**
    * Create a {@link TransportClient} connecting to the given remote host / port.
    *
   * We maintains an array of clients (size determined by spark.shuffle.io.numConnectionsPerPeer)
   * and randomly picks one to use. If no client was previously created in the randomly selected
   * spot, this function creates a new client and places it there.
   *
   * Prior to the creation of a new TransportClient, we will execute all
   * {@link TransportClientBootstrap}s that are registered with this factory.
   *
   * This blocks until a connection is successfully established and fully bootstrapped.
   *
   * Concurrency: This method is safe to call from multiple threads.
   */
  public TransportClient createClient(String remoteHost, int remotePort) throws IOException {
// Get connection from the connection pool first.
// If it is not found or not active, create a new one.
final InetSocketAddress address = new InetSocketAddress(remoteHost, remotePort);
  .
  .
  .   
  clientPool.clients[clientIndex] = createClient(address);

}

这是工作站点上的文件spark-env.sh

export SPARK_HOME=/home/szymon/spark
export SPARK_MASTER_IP=MASTER_IP
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_LOCAL_IP=WORKER_MACHINE_IP
export SPARK_DRIVER_HOST=WORKER_MACHINE_IP
export SPARK_LOCAL_DIRS=/home/szymon/spark/slaveData
export SPARK_WORKER_INSTANCES=10
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=4g
export SPARK_WORKER_DIR=/home/szymon/spark/work

在主人身上

export SPARK_MASTER_IP=MASTER_IP
export SPARK_LOCAL_IP=MASTER_IP
export SPARK_MASTER_WEBUI_PORT=8081
export SPARK_JAVA_OPTS="-Dlog4j.configuration=file:////home/szymon/spark/conf/log4j.properties -Dspark.driver.host=MASTER_IP"
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=10"

这是包含更多详细信息的完整日志输出

15/03/06 18:58:50 INFO Worker: Asked to launch executor app-20150306190555-0000/0 for Simple Application
15/03/06 18:58:50 INFO ExecutorRunner: Launch command: "/usr/java/jdk1.8.0_31/bin/java" "-cp" "::/home/szymon/spark/conf:/home/szymon/spark/assembly/target/scala-2.10/spark-assembly-1.2.1-hadoop2.4.0.jar" "-Dspark.driver.port=49407" "-Dlog4j.configuration=file:////home/szymon/spark/conf/log4j.properties" "-Dspark.driver.host=MASTER_IP" "-Xms3072M" "-Xmx3072M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriver@MASTER_IP:49407/user/CoarseGrainedScheduler" "0" "WORKER_MACHINE_IP" "1" "app-20150306190555-0000" "akka.tcp://sparkWorker@WORKER_MACHINE_IP:42240/user/Worker"
15/03/06 18:58:50 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/03/06 18:58:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/06 18:58:51 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:51 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:51 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:58:51 INFO Remoting: Starting remoting
15/03/06 18:58:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@WORKER_MACHINE_IP:52038]
15/03/06 18:58:51 INFO Utils: Successfully started service 'driverPropsFetcher' on port 52038.
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/03/06 18:58:52 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/03/06 18:58:52 INFO Slf4jLogger: Slf4jLogger started
15/03/06 18:58:52 INFO Remoting: Starting remoting
15/03/06 18:58:52 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/03/06 18:58:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@WORKER_MACHINE_IP:37114]
15/03/06 18:58:52 INFO Utils: Successfully started service 'sparkExecutor' on port 37114.
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkDriver@MASTER_IP:49407/user/CoarseGrainedScheduler
15/03/06 18:58:52 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@WORKER_MACHINE_IP:42240/user/Worker
15/03/06 18:58:52 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@WORKER_MACHINE_IP:42240/user/Worker
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
15/03/06 18:58:52 INFO Executor: Starting executor ID 0 on host WORKER_MACHINE_IP
15/03/06 18:58:52 INFO SecurityManager: Changing view acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: Changing modify acls to: szymon
15/03/06 18:58:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(szymon); users with modify permissions: Set(szymon)
15/03/06 18:58:52 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver@MASTER_IP:49407/user/MapOutputTracker
15/03/06 18:58:52 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkDriver@MASTER_IP:49407/user/BlockManagerMaster
15/03/06 18:58:52 INFO DiskBlockManager: Created local directory at /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-89577a43-fb43-4a12-a305-34b267b01f8a/spark-7ad207c4-9d37-42eb-95e4-7b909b71c687
15/03/06 18:58:52 INFO MemoryStore: MemoryStore started with capacity 1589.8 MB
15/03/06 18:58:52 INFO NettyBlockTransferService: Server created on 51205
15/03/06 18:58:52 INFO BlockManagerMaster: Trying to register BlockManager
15/03/06 18:58:52 INFO BlockManagerMaster: Registered BlockManager
15/03/06 18:58:52 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@MASTER_IP:49407/user/HeartbeatReceiver
15/03/06 18:58:52 INFO CoarseGrainedExecutorBackend: Got assigned task 0
15/03/06 18:58:52 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/06 18:58:52 INFO Executor: Fetching http://MASTER_IP:57850/jars/trial_2.11-0.9.jar with timestamp 1425665154479
15/03/06 18:58:52 INFO Utils: Fetching http://MASTER_IP:57850/jars/trial_2.11-0.9.jar to /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-411cd372-224e-44c1-84ab-b0c3984a6361/fetchFileTemp7857926599487994869.tmp
15/03/06 18:58:52 INFO Utils: Copying /home/szymon/spark/slaveData/spark-b09c3727-8559-4ab8-ab32-1f5ecf7aeaf2/spark-0c892a4d-c8b9-4144-a259-8077f5316b52/spark-411cd372-224e-44c1-84ab-b0c3984a6361/-19284804851425665154479_cache to /home/szymon/spark/work/app-20150306190555-0000/0/./trial_2.11-0.9.jar
15/03/06 18:58:52 INFO Executor: Adding file:/home/szymon/spark/work/app-20150306190555-0000/0/./trial_2.11-0.9.jar to class loader
15/03/06 18:58:52 INFO TorrentBroadcast: Started reading broadcast variable 0
15/03/06 18:58:52 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
    at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
    at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89)
    at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:595)
    at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:593)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:593)
    at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:587)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.org$apache$spark$broadcast$TorrentBroadcast$$anonfun$$getRemote$1(TorrentBroadcast.scala:126)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply(TorrentBroadcast.scala:136)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$1.apply(TorrentBroadcast.scala:136)
    at scala.Option.orElse(Option.scala:257)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
    at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
    at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
    at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
    at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
    at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
    at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
    at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
    at org.apache.spark.scheduler.Task.run(Task.scala:56)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:56545
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    ... 1 more
15/03/06 18:58:52 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
15/03/06 18:58:57 ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to localhost/127.0.0.1:56545
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
    at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
    at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:56545
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
    ... 1 more
.
.
.
15/03/06 19:00:22 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
15/03/06 19:00:24 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@WORKER_MACHINE_IP:37114] -> [akka.tcp://sparkDriver@MASTER_IP:49407] disassociated! Shutting down.
15/03/06 19:00:24 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@MASTER_IP:49407] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/03/06 19:00:24 INFO Worker: Asked to kill executor app-20150306190555-0000/0
15/03/06 19:00:24 INFO ExecutorRunner: Runner thread for executor app-20150306190555-0000/0 interrupted
15/03/06 19:00:24 INFO ExecutorRunner: Killing process!
15/03/06 19:00:25 INFO Worker: Executor app-20150306190555-0000/0 finished with state KILLED exitStatus 1
15/03/06 19:00:25 INFO Worker: Cleaning up local directories for application app-20150306190555-0000
15/03/06 19:00:25 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@WORKER_MACHINE_IP:37114] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/03/06 19:00:25 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40WORKER_MACHINE_IP%3A45806-2#1549100100] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.

有一个警告,我认为在这个问题上并非如此

15/03/06 18:07:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

1 个答案:

答案 0 :(得分:0)

您可以尝试设置conf.set(“spark.driver.host”,“”),客户端主机是启动spark-shell或其他脚本的主机。