Spark不会在yarn-cluster模式下运行最终的`saveAsNewAPIHadoopFile`方法

时间:2017-09-15 13:37:23

标签: hadoop apache-spark hdfs rdd

我编写了一个Spark应用程序,它读取一些CSV文件(~5-10 GB),转换数据并将数据转换为HFile。数据从HDFS中读取并保存到HDFS中。

当我在yarn-client模式下运行应用程序时,一切似乎都能正常工作。

但是当我尝试将其作为yarn-cluster应用程序运行时,该过程似乎不会对我已转换且准备好保存的RDD执行最终saveAsNewAPIHadoopFile操作!

以下是我的Spark UI的快照,您可以在其中看到所有其他作业都已处理:

enter image description here

相应的阶段:

enter image description here

这是我的应用程序调用{​​{1}}方法的最后一步:

saveAsNewAPIHadoopFile

我正在通过JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ... try { Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab"); Configuration baseConf = c.getConfiguration(); baseConf.set("hbase.zookeeper.quorum", HBASE_HOST); baseConf.set("zookeeper.znode.parent", "/hbase-secure"); Job job = Job.getInstance(baseConf, "Test Bulk Load"); HTable table = new HTable(baseConf, "map_data"); HBaseAdmin admin = new HBaseAdmin(baseConf); HFileOutputFormat2.configureIncrementalLoad(job, table); Configuration conf = job.getConfiguration(); cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf); System.out.println("Finished!!!!!"); } catch (IOException e) { e.printStackTrace(); System.out.println(e.getMessage()); }

运行appliaction

当我查看我的HDFS的输出目录时,它仍然是空的!我在HDP 2.5平台上使用Spark 1.6.3。

所以我在这里有两个问题:这个行为来自哪里(可能是内存问题)?纱线客户端和纱线群集模式之间有什么区别(我还不明白,文档对我来说还不清楚)?谢谢你的帮助!

2 个答案:

答案 0 :(得分:1)

似乎工作没有开始。在开始工作之前,Spark检查可用资源。我认为可用的资源还不够。因此,请尝试减少配置中的驱动程序和执行程序内存,驱动程序和执行程序核心。 在这里,您可以阅读如何计算执行者和驱动程序的资源的适当价值:https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

您的作业在客户端模式下运行,因为在客户端模式下,驱动器可以使用节点上的所有可用资源。但在集群模式下,资源有限。

群集和客户端模式之间的区别:
客户端:

Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.

集群:

Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

答案 1 :(得分:0)

我发现,这个问题与Kerberos问题有关!从我的Hadoop Namenode以yarn-client模式运行应用程序时,驱动程序正在该节点上运行,我的Kerberos服务器也在该节点上运行。因此,此计算机上存在文件userpricipal中使用的/etc/security/keytabs/user.keytab

yarn-cluster中运行应用程序时,驱动程序进程会在我的一个Hadoop节点上随机启动。由于我忘记在创建密钥文件后将其复制到其他节点,驱动程序进程当然会在该本地位置找到keytab文件!

因此,为了能够在Kerberized Hadoop集群中使用Spark(甚至在yarn-cluster模式下),您必须将运行spark-submit命令的用户所需的keytab文件复制到集群中所有节点上的相应路径!

scp /etc/security/keytabs/user.keytab user@workernode:/etc/security/keytabs/user.keytab

因此,您应该能够在群集的每个节点上创建kinit -kt /etc/security/keytabs/user.keytab user