Question

我有一个包含大量压缩文本文件的目录（其中“大文件”是解压后不适合堆的文件）。

我想在每个文件上应用reduce操作。操作需要按顺序处理行，并生成类型为A的小结果。

如何对目录中的所有文件应用此操作并获取RDD类型的(Path,A)？

换句话说，我正在寻找类似的东西：

sc.wholeTextFiles(dir).mapValues(operation)

...但是文件不需要存储在内存中。

Answer 1

如果它们被gzip压缩，那么每个文件将获得一个分区，因此您可以使用以下内容：

sc.textFile(dir).mapPartitions(it => it.reduce(operation))
// mapPartitions gives you an iterator for each file, 
// apply the reduce operation on this.

处理许多大文件

1 个答案: