从控制台完整记录

Question

我正在尝试使用SSH隧道进行Spark的非常简单的设置，我无法使其工作。

我已经在我的电脑上运行了这个设置./sbin/start-master.sh -h localhost -p 7077（如果没有另外说明，其他一切都是默认设置）。

在我的从属PC（IP为192.168.0.222）上，在其他域中，我没有root访问权限，我创建ssh -N -L localhost:7078:localhost:7077 myMasterPCSSHalias并使用./sbin/start-slave.sh spark://localhost:7078运行slave。我现在可以在浏览器http://localhost:8080/的仪表板上看到此奴隶。我看到它有14GB的可用内存。

当我尝试例如这个例子：

./bin/spark-submit --master spark://localhost:7077 examples/src/main/python/pi.py 10

它会挂起此消息，直到我将其删除（您可以在下面看到完整的日志消息）：

WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

我确信我没有使用比现有资源更多的资源，即使我使用--executor-memory 512m并且运行执行程序只是发出RUNNING状态，问题仍然存在。错误日志中唯一的一点是：

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:45:44 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
16/05/09 22:45:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:45:45 INFO SecurityManager: Changing view acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: Changing modify acls to: hnykdan1,dan
16/05/09 22:45:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1, dan); users with modify permissions: Set(hnykdan1, dan)

在奴隶日志中是这样的：

16/05/09 22:48:56 INFO Worker: Asked to launch executor app-20160509224034-0013/0 for PythonPi
16/05/09 22:48:56 INFO SecurityManager: Changing view acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: Changing modify acls to: hnykdan1
16/05/09 22:48:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hnykdan1); users with modify permissions: Set(hnykdan1)
16/05/09 22:48:56 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp" "/home/hnykdan1/spark/conf/:/home/hnykdan1/spark/lib/spark-assembly-1.6.1-hadoop2.6.0.jar:/home/hnykdan1/spark/lib/datanucleus-core-3.2.10.jar:/home/hnykdan1/spark/lib/datanucleus-api-jdo-3.2.6.jar:/home/hnykdan1/spark/lib/datanucleus-rdbms-3.2.9.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.driver.port=37450" "-XX:MaxPermSize=256m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@192.168.0.222:37450" "--executor-id" "0" "--hostname" "147.32.8.103" "--cores" "8" "--app-id" "app-20160509224034-0013" "--worker-url" "spark://Worker@147.32.8.103:54894"

一切看起来很正常，我不知道哪里可能有问题。相反，我是否需要隧道？当我以完全相同的方式在本地运行slave时，它运行正常。感谢

从控制台完整记录

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/09 22:28:21 INFO SparkContext: Running Spark version 1.6.1
16/05/09 22:28:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/09 22:28:22 INFO SecurityManager: Changing view acls to: dan
16/05/09 22:28:22 INFO SecurityManager: Changing modify acls to: dan
16/05/09 22:28:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(dan); users with modify permissions: Set(dan)
16/05/09 22:28:22 INFO Utils: Successfully started service 'sparkDriver' on port 34508.
16/05/09 22:28:23 INFO Slf4jLogger: Slf4jLogger started
16/05/09 22:28:23 INFO Remoting: Starting remoting
16/05/09 22:28:23 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.0.222:44359]
16/05/09 22:28:23 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 44359.
16/05/09 22:28:23 INFO SparkEnv: Registering MapOutputTracker
16/05/09 22:28:23 INFO SparkEnv: Registering BlockManagerMaster
16/05/09 22:28:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-db4c3293-423f-4966-a479-b69a90439da9
16/05/09 22:28:23 INFO MemoryStore: MemoryStore started with capacity 511.1 MB
16/05/09 22:28:23 INFO SparkEnv: Registering OutputCommitCoordinator
16/05/09 22:28:24 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/05/09 22:28:24 INFO SparkUI: Started SparkUI at http://192.168.0.222:4040
16/05/09 22:28:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/httpd-aa031b8c-f605-41c3-aabe-fc4fe01bdcf8
16/05/09 22:28:24 INFO HttpServer: Starting HTTP Server
16/05/09 22:28:24 INFO Utils: Successfully started service 'HTTP file server' on port 41770.
16/05/09 22:28:24 INFO Utils: Copying /home/hnykdan1/spark/examples/src/main/python/pi.py to /tmp/spark-d532a9c1-0455-4937-ad27-b47abb2a65e8/userFiles-14720bed-cd41-4b15-9bd3-38dbf4f268ff/pi.py
16/05/09 22:28:24 INFO SparkContext: Added file file:/home/hnykdan1/spark/examples/src/main/python/pi.py at http://192.168.0.222:41770/files/pi.py with timestamp 1462825704629
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Connecting to master spark://localhost:7077...
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160509222824-0011
16/05/09 22:28:24 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44617.
16/05/09 22:28:24 INFO NettyBlockTransferService: Server created on 44617
16/05/09 22:28:24 INFO AppClient$ClientEndpoint: Executor added: app-20160509222824-0011/0 on worker-20160509214654-147.32.8.103-54894 (147.32.8.103:54894) with 8 cores
16/05/09 22:28:24 INFO BlockManagerMaster: Trying to register BlockManager
16/05/09 22:28:24 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160509222824-0011/0 on hostPort 147.32.8.103:54894 with 8 cores, 1024.0 MB RAM
16/05/09 22:28:24 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.222:44617 with 511.1 MB RAM, BlockManagerId(driver, 192.168.0.222, 44617)
16/05/09 22:28:24 INFO BlockManagerMaster: Registered BlockManager
16/05/09 22:28:25 INFO AppClient$ClientEndpoint: Executor updated: app-20160509222824-0011/0 is now RUNNING
16/05/09 22:28:25 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/05/09 22:28:25 INFO SparkContext: Starting job: reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39
16/05/09 22:28:25 INFO DAGScheduler: Got job 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39) with 10 output partitions
16/05/09 22:28:25 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:25 INFO DAGScheduler: Parents of final stage: List()
16/05/09 22:28:25 INFO DAGScheduler: Missing parents: List()
16/05/09 22:28:25 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39), which has no missing parents
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.0 KB, free 4.0 KB)
16/05/09 22:28:26 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 6.7 KB)
16/05/09 22:28:26 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.0.222:44617 (size: 2.7 KB, free: 511.1 MB)
16/05/09 22:28:26 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/05/09 22:28:26 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /home/hnykdan1/spark/examples/src/main/python/pi.py:39)
16/05/09 22:28:26 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
16/05/09 22:28:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:28:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:41 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:29:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:11 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/05/09 22:30:26 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Answer 1

由于您检查了自己拥有资源，因此下一个最可能的问题是执行程序无法连接回驱动程序。提交作业时，驱动程序启动执行程序将连接到的服务器以下载jar。

是的，错误消息（Initial job has not accepted any resources...）与网络问题无关。这是一个已知问题，例如： https://github.com/databricks/spark-knowledgebase/issues/9

Answer 2

它可能与网络有关（安全组规则）。这是一个愚蠢的测试，但我只是通过向所有TCP流量（入站/出站）开放主服务器和工作服务器来使其工作。

Spark没有对slave做任何工作：初始作业没有接受任何资源

从控制台完整记录

2 个答案: