无法以编程方式将Spark应用程序(使用Cassandra连接器)从远程客户端提交到群集

时间:2014-12-07 04:16:39

标签: java linux amazon-ec2 cassandra apache-spark

我在EC2上运行独立的Spark群集,我正在使用Spark-Cassandra连接器驱动程序编写应用程序,并尝试以编程方式将作业提交到Spark群集。 工作本身很简单:

public static void main(String[] args) {
    SparkConf conf;
    JavaSparkContext sc;
    conf = new SparkConf()
            .set("spark.cassandra.connection.host", host);
    conf.set("spark.driver.host", "[my_public_ip]");
    conf.set("spark.driver.port", "15000");
    sc = new JavaSparkContext("spark://[spark_master_host]","test",conf);
    CassandraJavaRDD<CassandraRow> rdd = javaFunctions(sc).cassandraTable(
            "keyspace", "table");
    System.out.println(rdd.first().toString());
    sc.stop();
}

当我在EC2集群的Spark Master节点中运行时,运行正常。 我试图在远程Windows客户端中运行它。 问题来自这两行:

    conf.set("spark.driver.host", "[my_public_ip]");
    conf.set("spark.driver.port", "15000");

首先,如果我注释掉这两行,应用程序不会抛出异常,但是Executor没有运行,并且有以下日志:

14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/3 is now LOADING

14/12/06 22:40:03 INFO client.AppClient$ClientActor: Executor updated: app-20141207033931-0021/0 is now EXITED (Command exited with code 1)

14/12/06 22:40:03 INFO cluster.SparkDeploySchedulerBackend: Executor app-20141207033931-0021/0 removed: Command exited with code 1

永远不会结束,当我检查工作节点日志时,我发现:

14/12/06 22:40:21 ERROR security.UserGroupInformation: PriviledgedActionException as:[username] cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs    
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52) 
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
... 4 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
        ... 7 more

我不知道那是什么,我的猜测是工作节点可能无法连接到驱动程序,驱动程序可能最初设置为:

14/12/06 22:39:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@[some_host_name]:52660]
14/12/06 22:39:30 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@[some_host_name]:52660]

显然,没有DNS会解析我的主机名......

由于我无法通过"client"脚本将部署模式设置为"cluster"./spark-submit。(我觉得这很荒谬......) 。我尝试在所有Spark Master Worker节点的"XX.XXX.XXX.XX [host-name]"中添加主机解析/etc/hosts

当然没有运气...... 这导致我第二次,不评论两行;

这给了我:

14/12/06 22:59:41 INFO Remoting: Starting remoting
14/12/06 22:59:41 ERROR Remoting: Remoting error: [Startup failed] [
akka.remote.RemoteTransportException: Startup failed
        at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
        at akka.remote.Remoting.start(Remoting.scala:194)
        ...

原因:

Caused by: org.jboss.netty.channel.ChannelException: Failed to bind to: /[my_public_ip]:15000
        at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
        at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:391)
        at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:388)

我仔细检查了我的防火墙设置和路由器设置,确认我的防火墙是否已拨打;和netstat -an确认端口15000未被使用(事实上我试图改为几个可用的端口,没有运气);我ping来自我的集群的其他机器和机器的公共IP,没问题。

现在我完全搞砸了,我只是试着解决这个问题。有什么建议?任何帮助表示赞赏!

1 个答案:

答案 0 :(得分:-2)

请检查15000是否在您的安全组中。