Question

在Java代码中，我想连接到HDFS中的目录，了解该目录中的文件数量，获取它们的名称并想要读取它们。我已经可以读取文件，但我无法弄清楚如何计算目录中的文件并获取普通目录等文件名。

为了阅读我使用DFSClient并将文件打开到InputStream。

Answer 1

<强>计数

Usage: hadoop fs -count [-q] <paths>

计算与指定文件模式匹配的路径下的目录，文件和字节数。输出列是： DIR_COUNT，FILE_COUNT，CONTENT_SIZE FILE_NAME。

-q的输出列为： QUOTA，REMAINING_QUATA，SPACE_QUOTA，REMAINING_SPACE_QUOTA，DIR_COUNT，FILE_COUNT，CONTENT_SIZE，FILE_NAME。

示例：

hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2 hadoop fs -count -q hdfs://nn1.example.com/file1

退出代码：

成功时返回0，错误时返回-1。

您可以使用FileSystem并迭代路径中的文件。以下是一些示例代码

int count = 0; FileSystem fs = FileSystem.get(getConf()); boolean recursive = false; RemoteIterator<LocatedFileStatus> ri = fs.listFiles(new Path("hdfs://my/path"), recursive); while (ri.hasNext()){ count++; ri.next(); }

Answer 2

FileSystem fs = FileSystem.get(conf);
Path pt = new Path("/path");
ContentSummary cs = fs.getContentSummary(pt);
long fileCount = cs.getFileCount();

Answer 3

您也可以尝试：

hdfs dfs -ls -R /path/to/your/directory/ | grep -E '^-' | wc -l

Answer 4

在命令行上，您可以按照以下方式执行此操作。

 hdfs dfs -ls $parentdirectory | awk '{system("hdfs dfs -count " $6) }'

Answer 5

hadoop fs -du [-s] [-h] [-x] URI [URI ...]

显示给定目录中包含的文件和目录的大小，或者在文件只是文件的情况下显示文件的长度。

选项：

The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, calculation is done by going 1-level deep from the given path.
The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

Answer 6

您可以使用以下方法检查特定目录中的文件计数

hadoop fs -count /directoryPath/* | print $2 | wc -l

count : counts the number of files, directories, and bytes under the path

print $2 : To print second column from the output

wc -l : To check the line count

HDFS目录中的文件计数

6 个答案: