FileStatus用于递归目录

时间:2013-07-12 15:30:42

标签: recursion hadoop hdfs cloudera

我有以下目录结构,

Dir1
 |___Dir2 
  |___Dir3
   |___Dir4
     |___File1.gz
     |___File2.gz
     |___File3.gz

子目录只是嵌套而且不包含任何文件

我正在尝试使用以下内容来递归HDFS上的目录。如果是一个目录,我将/*附加到路径并addInputPath

arg[0] = "path/to/Dir1"; // given at command line

FileStatus fs = new FileStatus(); 
Path q = new Path(args[0]); 
FileInputFormat.addInputPath(job,q);

Path p = new Path(q.toString()+"/*");
fs.setPath(p);  

while(fs.isDirectory())
{
    fs.setPath(new Path(p.toString()+"/*"));
    FileInputFormat.addInputPath(job,fs.getPath());
}           

但代码似乎没有进入while循环,我得到not a File异常

1 个答案:

答案 0 :(得分:4)

您所指的 if 语句在哪里?
无论如何,您可以查看这些实用程序方法,它们将目录中的所有文件添加到作业的输入中:

Utils:

public static Path[] getRecursivePaths(FileSystem fs, String basePath) 
  throws IOException, URISyntaxException {
    List<Path> result = new ArrayList<Path>();
    basePath = fs.getUri() + basePath;
    FileStatus[] listStatus = fs.globStatus(new Path(basePath+"/*"));
    for (FileStatus fstat : listStatus) {
      readSubDirectory(fstat, basePath, fs, result);
    }
    return (Path[]) result.toArray(new Path[result.size()]);  
}

private static void readSubDirectory(FileStatus fileStatus, String basePath,
  FileSystem fs, List<Path> paths) throws IOException, URISyntaxException {
  if (!fileStatus.isDir()) {
   paths.add(fileStatus.getPath());
  }
  else {
    String subPath = fileStatus.getPath().toString();
    FileStatus[] listStatus = fs.globStatus(new Path(subPath + "/*"));
    if (listStatus.length == 0) {
      paths.add(fileStatus.getPath());
    }
    for (FileStatus fst : listStatus) {
      readSubDirectory(fst, subPath, fs, paths);
    }
  }
}

在你的职业跑步者课程中使用它:

...
Path[] inputPaths = Utils.getRecursivePaths(fs, inputPath);
FileInputFormat.setInputPaths(job, inputPaths);
...