我刚刚推出了Hadoop / Spark集群,以便在我的公司启动数据科学计划。我使用Ambari作为管理器并安装了Hortonworks发行版(HDFS 2.7.3,Hive 1.2.1,Spark 2.1.1,以及其他所需的服务。顺便说一句,我正在运行RHEL 7.我有2个名称节点,10个数据节点,1个蜂巢节点和1个管理节点(Ambari)。
我根据Apache和Ambari文档构建了一个防火墙端口列表,并让我的基础设施人员推动这些规则。我遇到了Spark想要选择随机端口的问题。当我尝试运行Spark作业(传统的Pi示例)时,它会失败,因为我没有打开整个短暂的端口范围。由于我们可能会运行多个作业,因此让Spark处理此问题并从短暂的端口范围(1024 - 65535)中进行选择而不是指定单个端口是有意义的。我知道我可以选择一个范围,但为了方便起见我只是让我的家伙打开整个短暂的范围。起初我的基础设施人员对此犹豫不决,但当我告诉他们目的时,他们就这样做了。
基于此,我认为我的问题已得到修复,但是当我尝试运行某个工作时,它仍然失败了:
Log Type: stderr
Log Upload Time: Thu Oct 12 11:31:01 -0700 2017
Log Length: 14617
Showing 4096 bytes of 14617 total. Click here for the full log.
Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:52 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:53 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:54 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:55 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:56 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:57 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:28:59 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:00 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:01 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:02 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:03 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:04 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:05 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:06 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:07 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:09 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:10 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:11 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:12 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:13 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:14 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Failed to connect to driver at 10.10.98.191:33937, retrying ...
17/10/12 11:29:15 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Failed to connect to driver!
at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:607)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:461)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:283)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:783)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:781)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:804)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
17/10/12 11:29:15 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver!)
17/10/12 11:29:15 INFO ShutdownHookManager: Shutdown hook called
起初我想也许我对Spark和namenodes / datanodes有一些错误的配置。然而,为了测试它,我只是在每个节点上停止了firewalld并再次尝试了这项工作,它运行得很好。
所以,我的问题 - 我打开了整个1024 - 65535端口范围 - 我可以看到Spark驱动程序正在尝试连接这些高端口(如上图所示 - 30k - 40k范围)。但是,由于某种原因,当防火墙打开时,它会失败,当它关闭时它会起作用。我检查了防火墙规则,果然,端口是开放的 - 这些规则正在运行,因为我可以访问Ambair,Yarn和HFDS的Web服务,这些服务在同一个firewalld xml规则文件中指定....
我是Hadoop / Spark的新手,所以我想知道我有什么遗失的东西吗? 1024下我需要考虑一些较低的端口吗?以下是我打开的1024以下端口列表,以及1024 - 65535端口范围:
88
111
443
1004
1006
1019
很可能我错过了一个我真正需要的低端号码,而且我不知道。除此之外,其他所有内容都应由1024 - 65535端口范围处理。
答案 0 :(得分:0)
好的,所以与Hortonworks社区的一些人合作,我能够找到一个解决方案。基本上,您需要至少定义一个端口,但您可以通过指定spark.port.MaxRetries = xxxxx来扩展它。通过将此设置与spark.driver.port = xxxxx组合,您可以从spark.driver.port开始并以spark.port.maxRetries结束。
如果您使用Ambari作为您的经理,则设置在"自定义spark2-defaults"部分(我假设在完全开源的堆栈安装下,这只是普通Spark配置下的设置):
我被建议将这些端口分开32个计数块,例如,如果你的驱动程序从40000开始,你应该在40033等处启动spark.blockManager.port。请看这篇文章:
https://community.hortonworks.com/questions/141484/spark-jobs-failing-firewall-issue.html