Question

我有一个Spark集群设置，其中包含一个主服务器和3个工作服务器。我也在CentOS VM上安装了Spark。我正在尝试从我的本地VM运行一个Spark shell，它将连接到master，并允许我执行简单的Scala代码。所以，这是我在本地VM上运行的命令：

bin/spark-shell --master spark://spark01:7077

shell运行到我可以输入Scala代码的位置。它说执行者已被授予（x3 - 每个工人一个）。如果我查看Master的UI，我可以看到一个正在运行的应用程序， Spark shell 。所有工作人员都是ALIVE，使用了2/2个核心，并为应用程序分配了512 MB（5 GB中）。所以，我尝试执行以下Scala代码：

sc.parallelize(1 to 100).count

不幸的是，该命令不起作用。 shell将无休止地打印相同的警告：

INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

根据我对该问题的研究，我确认我使用的主URL与Web UI上的主URL相同。我可以ping和ssh两种方式（群集到本地VM，反之亦然）。而且，我使用了executor-memory参数（增加和减少内存）都无济于事。最后，我尝试禁用双方的防火墙（iptables），但我一直得到同样的错误。我正在使用Spark 1.0.2。

TL; DR 是否可以远程运行Apache Spark shell（并且本身可以远程提交应用程序）？如果是这样，我错过了什么？

编辑：我看了一下工作日志，发现工人找不到Spark：

ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...

Spark安装在本地VM上的其他目录中，而不是安装在群集上。工作者试图找到的路径是我本地VM上的路径。有没有办法指定这条路径？或者他们到处都必须相同？

目前，我调整了目录以避免此错误。现在，在我有机会输入count命令（Master removed our application: FAILED）之前，我的Spark Shell失败了。所有工人都有同样的错误：

ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@spark02:7078] -> [akka.tcp://sparkExecutor@spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor@spark02:53633]] 
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@spark02:53633] 
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633

如我所料，我遇到了网络问题。我现在该看什么？

Answer 1

我在我的spark客户端和spark群集中解决了这个问题。

检查你的网络，客户端A可以互相ping群集！然后在客户端A上的spark-env.sh中添加两行配置。

首先

export SPARK_MASTER_IP=172.100.102.156  
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar

<强>第二

使用群集模式测试火花壳！

Answer 2

此问题可能是由网络配置引起的。看起来错误TaskSchedulerImpl: Initial job has not accepted any resources可能有很多原因（另见this answer）：

实际资源短缺
主人与工人之间沟通不畅
主人/工人与司机之间的沟通中断

排除第一种可能性的最简单方法是使用直接在主服务器上运行的Spark shell运行测试。如果这样可行，则群集内的群集通信本身很好，并且问题是由与驱动程序主机的通信引起的。为了进一步分析问题，它有助于查看包含

等条目的工作日志

16/08/14 09:21:52 INFO ExecutorRunner: Launch command: 
    "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" 
    ... 
    "--driver-url" "spark://CoarseGrainedScheduler@192.168.1.228:37752"  
    ...

并测试worker是否可以建立与驱动程序的IP /端口的连接。除了一般的防火墙/端口转发问题之外，驱动程序可能绑定到错误的网络接口。在这种情况下，您可以在启动Spark shell之前在驱动程序上导出SPARK_LOCAL_IP，以便绑定到其他接口。

其他一些参考资料：

Knowledge base entry有关网络连接问题。
Github discussion关于改进Initial job has not accepted any resources。

如何远程运行Apache Spark shell？

2 个答案: