Question

我在我的HDFS（使用Spark应用程序）中写了大约300个包含HFiles的目录。它们像这样保存到HDFS：

/user/testuser/data/hfiles_iteration_0/...
/user/testuser/data/hfiles_iteration_1/...
/user/testuser/data/hfiles_iteration_2/...

这些文件是从~15,000个CSV文件创建的，总大小约为100 GB。我将CSV文件拆分成束并在我的Spark应用程序中迭代处理它们。但是，现在我需要处理这300个HFile，以便将数据添加到我的HBase表中。

我通过

创建了我的HBase表

create 'my_table', 
  {NAME => 'k', DATA_BLOCK_ENCODING => 'FAST_DIFF', COMPRESSION => 'SNAPPY'}, 
  {NAME => 'b', DATA_BLOCK_ENCODING => 'FAST_DIFF', COMPRESSION => 'SNAPPY'}, 
  {NAME => 't', DATA_BLOCK_ENCODING => 'FAST_DIFF', COMPRESSION => 'SNAPPY'}, 
  {SPLITS => [ '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19']}

HFile的数据应该已经被20个HBase区域ID 0到19分区。

因为我必须在另一个之后处理一个HFile目录，所以我创建了一个也可以迭代工作的Java应用程序并自动运行以下命令：

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dcreate.table=no <pathToReadFrom> my_table

//=> with <pathToReadFrom> = /user/testuser/data/hfiles_iteration_[0-299]

这里整个逻辑作为代码片段：

TreeSet<String> subDirs = getHFileDirectories(new Path(HDFS_PATH), hadoopConf);

for(String hFileDir : subDirs) {

    try {
        String pathToReadFrom = HDFS_OUTPUT_PATH + "/" + hFileDir;
        String[] execCode = {"hbase", "org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles", "-Dcreate.table=no", pathToReadFrom, hbaseTableName};
        ProcessBuilder pb = new ProcessBuilder(execCode);
        pb.redirectErrorStream(true);
        final Process p = pb.start();

        // Write the output of the Process to the console
        new Thread(new Runnable() {
            public void run() {
                BufferedReader input = new BufferedReader(new InputStreamReader(p.getInputStream()));
                String line = null; 

                try {
                    while ((line = input.readLine()) != null)
                        System.out.println(line);
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }).start();

    // Wait for the end of the execution
    p.waitFor();
    ...
}

该应用程序适用于~300个HFile目录中的前200个。但是，LoadIncrementalHFiles的进一步调用会因异常而失败。

有人可以帮忙吗？我究竟做错了什么？可能会有一些记忆问题吗？每次调用都需要一些“冷却时间”，直到下一次迭代开始？我看不到代码中的问题＆gt; 60％的迭代效果很好！

最后，这里是我主机内存使用情况的截图，当我启动应用程序时，Cache值开始增加（在屏幕截图中它仍在运行）：

另一个观察结果：如果一个LoadIncrementalHFiles失败一次，那么所有进一步的迭代也会失败。重新启动作业没有用。此外，重新启动Hadoop服务或完整节点也无济于事。我必须通过hbase-shell删除并重新创建HBase表，然后我能够再次运行应用程序（直到与之前出现问题的迭代次数相同）！

其他测试：此处显示lsof | wc -l命令的输出，显示我的用户的打开文件 - 在执行期间增加非常强：

// ***************************** Before execution *****************************

[user@hadoop ~]$ lsof | wc -l
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/42/gvfs
      Output information may be incomplete.
14160

// ***************************** While execution ******************************

[user@hadoop ~]$ lsof | wc -l
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/42/gvfs
      Output information may be incomplete.
99879

// ***************************** After execution ******************************

[user@hadoop ~]$ lsof | wc -l
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/42/gvfs
      Output information may be incomplete.
14552

谢谢！

LoadIncrementalHFiles会增加已用内存，并在一段时间后失败

0 个答案: