
时间:2017-09-15 13:37:23

标签: hadoop apache-spark hdfs rdd

我编写了一个Spark应用程序,它读取一些CSV文件(~5-10 GB),转换数据并将数据转换为HFile。数据从HDFS中读取并保存到HDFS中。



以下是我的Spark UI的快照,您可以在其中看到所有其他作业都已处理:

enter image description here


enter image description here



我正在通过JavaPairRDD<ImmutableBytesWritable, KeyValue> cells = ... try { Connection c = HBaseKerberos.createHBaseConnectionKerberized("userpricipal", "/etc/security/keytabs/user.keytab"); Configuration baseConf = c.getConfiguration(); baseConf.set("hbase.zookeeper.quorum", HBASE_HOST); baseConf.set("zookeeper.znode.parent", "/hbase-secure"); Job job = Job.getInstance(baseConf, "Test Bulk Load"); HTable table = new HTable(baseConf, "map_data"); HBaseAdmin admin = new HBaseAdmin(baseConf); HFileOutputFormat2.configureIncrementalLoad(job, table); Configuration conf = job.getConfiguration(); cells.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, conf); System.out.println("Finished!!!!!"); } catch (IOException e) { e.printStackTrace(); System.out.println(e.getMessage()); }


当我查看我的HDFS的输出目录时,它仍然是空的!我在HDP 2.5平台上使用Spark 1.6.3。


2 个答案:

答案 0 :(得分:1)

似乎工作没有开始。在开始工作之前,Spark检查可用资源。我认为可用的资源还不够。因此,请尝试减少配置中的驱动程序和执行程序内存,驱动程序和执行程序核心。 在这里,您可以阅读如何计算执行者和驱动程序的资源的适当价值:



Driver runs on a dedicated server (Master node) inside a dedicated process. This means it has all available resources at it's disposal to execute work.
Driver opens up a dedicated Netty HTTP server and distributes the JAR files specified to all Worker nodes (big advantage).
Because the Master node has dedicated resources of it's own, you don't need to "spend" worker resources for the Driver program.
If the driver process dies, you need an external monitoring system to reset it's execution.


Driver runs on one of the cluster's Worker nodes. The worker is chosen by the Master leader
Driver runs as a dedicated, standalone process inside the Worker.
Driver programs takes up at least 1 core and a dedicated amount of memory from one of the workers (this can be configured).
Driver program can be monitored from the Master node using the --supervise flag and be reset in case it dies.
When working in Cluster mode, all JARs related to the execution of your application need to be publicly available to all the workers. This means you can either manually place them in a shared place or in a folder for each of the workers.

答案 1 :(得分:0)

我发现,这个问题与Kerberos问题有关!从我的Hadoop Namenode以yarn-client模式运行应用程序时,驱动程序正在该节点上运行,我的Kerberos服务器也在该节点上运行。因此,此计算机上存在文件userpricipal中使用的/etc/security/keytabs/user.keytab


因此,为了能够在Kerberized Hadoop集群中使用Spark(甚至在yarn-cluster模式下),您必须将运行spark-submit命令的用户所需的keytab文件复制到集群中所有节点上的相应路径!

scp /etc/security/keytabs/user.keytab user@workernode:/etc/security/keytabs/user.keytab

因此,您应该能够在群集的每个节点上创建kinit -kt /etc/security/keytabs/user.keytab user