Spark执行器无法连接到神秘端口35529

时间:2017-02-27 13:54:53

标签: apache-spark hdfs

我有一个火花簇(7 * 2核心),它在spark 2.0.2上设置,位于hdfs群集旁边。

当我使用Jupyter读取一些hdfs文件时,我看到应用程序启动,使用14个核心和3个,但是所有工作人员都无法启动任何任务,因为网络无法连接到一个奇怪的" localhost& #34; 35529号港口。

spark       = SparkSession.builder.master(master).appName(appName).config("spark.executor.instances", 3).getOrCreate()
sc          = spark.sparkContext
hdfs_master = "hdfs://xx.xx.xx.xx:8020"
hdfs_path   = "/logs/cycliste_debug/2017/2017_02/2017_02_20/23h/*"
infos       = sc.textFile(hdfs_master+hdfs_path)

我明白了: enter image description here

(这让我觉得在只有3 * 2可能的情况下看到分配了14个核心是很奇怪的:即节点的cpupu的CTR.executor.instances * nb):

以下是群集摘要:

app-20170227140938-0009的执行人摘要:

ExecutorID  Worker  Cores   Memory  State ▾ Logs
1488    worker-20170227125912-xx.xx.xx.xx-38028 2   1024    RUNNING stdout stderr
1489    worker-20170227125954-xx.xx.xx.xx-48962 2   1024    RUNNING stdout stderr
5       worker-20170227125959-xx.xx.xx.xx-48149 2   1024    RUNNING stdout stderr
1486    worker-20170227130012-xx.xx.xx.xx-47639 2   1024    RUNNING stdout stderr
1490    worker-20170227130027-xx.xx.xx.xx-44921 2   1024    RUNNING stdout stderr
1485    worker-20170227130152-xx.xx.xx.xx-50620 2   1024    RUNNING stdout stderr
1487    worker-20170227130248-xx.xx.xx.xx-42100 2   1024    RUNNING stdout stderr

以及一名工人的错误示例:

app-20170227140938-0009 / 1488的stderr日志页面

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/27 14:37:57 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 5864@vpsxxx.ovh.net
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for TERM
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for HUP
17/02/27 14:37:57 INFO SignalUtils: Registered signal handler for INT
17/02/27 14:37:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/27 14:37:58 INFO SecurityManager: Changing view acls to: spark
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls to: spark
17/02/27 14:37:58 INFO SecurityManager: Changing view acls groups to: 
17/02/27 14:37:58 INFO SecurityManager: Changing modify acls groups to: 
17/02/27 14:37:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
17/02/27 14:38:01 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:70)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:174)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:270)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:71)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to localhost/127.0.0.1:35529
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:35529
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
    at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
    ... 1 more

我知道两个进程之间存在简单的通信问题。

所以我显示了/ etc / hosts:

127.0.0.1   localhost
193.xx.xx.xxx   vpsxxxx.ovh.net vpsxxxx

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

检查SPARK_LOCAL_IP是否设置为每个从站中的正确IP。