Question

我创建了一个带有指定架构的外部Hive表，但是没有数据，比如表A.现在让我们说我在HDFS目录中有以下列方式组织的CSV文件：

20150718/dir1/dir2/file1.csv
20150718/dir1/dir2/file2.csv
...................
20150718/dir1/dir2/..../dirN/file10000.csv

换句话说，文件可以在目录20150718中处于多个不同级别的目录中。如何在一个Hive / shell命令中加载这些CSV文件？

另一个注意事项是我打算根据日期继续创建分区，然后我该怎么办？仍然是一个新的Hive用户，建议表示赞赏。

Answer 1

//获取配置

Configuration conf = getConf();
FileSystem fs = inputPath.getFileSystem(conf);

//在您的情况下指定过滤器，日期。

PathFilter pf = new FileFilter(conf, fs, new String[] { "txt" });

//递归移动或复制

moveRecursivelytoTarget(target, fs, inputPath, pf);

protected void moveRecursivelytoTarget(String target, FileSystem fs, Path path, PathFilter inputFilter)
    throws IOException
  {
    for (FileStatus stat : fs.listStatus(path, inputFilter))
      if (stat.isDir())
        moveRecursivelytoTarget(target, fs, stat.getPath(), inputFilter);
      else
      {
        fs.copyFromLocalFile(stat.getPath(), target);
        //Or rename
        //rename(stat.getPath(), target) 
      }
 }

你也可以在shell中遵循相同的程序。

为了创建动态分区，将上面收集的信息放入一个临时表中，将其称为tableA，然后从tableA读取并使用parttion写入tableMain，并且可以清理tableA for day。

set hive.exec.dynamic.partition=true; 
INSERT OVERWRITE TABLE tableMain PARTITION (date) SELECT x,y,z 
FROM tableA t;

如何在一个目录中递归加载多个CSV表到Hive中

1 个答案: