我有以下目录结构,
Dir1
|___Dir2
|___Dir3
|___Dir4
|___File1.gz
|___File2.gz
|___File3.gz
子目录只是嵌套而且不包含任何文件
我正在尝试使用以下内容来递归HDFS上的目录。如果是一个目录,我将/*
附加到路径并addInputPath
arg[0] = "path/to/Dir1"; // given at command line
FileStatus fs = new FileStatus();
Path q = new Path(args[0]);
FileInputFormat.addInputPath(job,q);
Path p = new Path(q.toString()+"/*");
fs.setPath(p);
while(fs.isDirectory())
{
fs.setPath(new Path(p.toString()+"/*"));
FileInputFormat.addInputPath(job,fs.getPath());
}
但代码似乎没有进入while
循环,我得到not a File
异常
答案 0 :(得分:4)
您所指的 if 语句在哪里?
无论如何,您可以查看这些实用程序方法,它们将目录中的所有文件添加到作业的输入中:
Utils:
public static Path[] getRecursivePaths(FileSystem fs, String basePath)
throws IOException, URISyntaxException {
List<Path> result = new ArrayList<Path>();
basePath = fs.getUri() + basePath;
FileStatus[] listStatus = fs.globStatus(new Path(basePath+"/*"));
for (FileStatus fstat : listStatus) {
readSubDirectory(fstat, basePath, fs, result);
}
return (Path[]) result.toArray(new Path[result.size()]);
}
private static void readSubDirectory(FileStatus fileStatus, String basePath,
FileSystem fs, List<Path> paths) throws IOException, URISyntaxException {
if (!fileStatus.isDir()) {
paths.add(fileStatus.getPath());
}
else {
String subPath = fileStatus.getPath().toString();
FileStatus[] listStatus = fs.globStatus(new Path(subPath + "/*"));
if (listStatus.length == 0) {
paths.add(fileStatus.getPath());
}
for (FileStatus fst : listStatus) {
readSubDirectory(fst, subPath, fs, paths);
}
}
}
在你的职业跑步者课程中使用它:
...
Path[] inputPaths = Utils.getRecursivePaths(fs, inputPath);
FileInputFormat.setInputPaths(job, inputPaths);
...