Question

我最多可以拥有10万个小文件（每个10-50 KB）。它们都存储在HDFS中，块大小为128 MB。我必须立即用Apache Spark阅读它们，如下所示：

// return a list of paths to small files
List<Sting> paths = getAllPaths(); 
// read up to 100000 small files at once into memory
sparkSession
    .read()
    .parquet(paths)
    .as(Encoders.kryo(SmallFileWrapper.class))
    .coalesce(numPartitions);

问题

The number of small files is not a problem from the perspective of memory consumption。问题是读取该文件数量的速度。读取490个小文件需要38秒，读取3420个文件需要266秒。我想要读取100.000个文件需要花费很多时间。

问题

将HAR或序列文件加速Apache Spark批量读取10k-100k的小文件？为什么？的

HAR或序列文件是否会减慢这些小文件的持久性？为什么？的

P.S。

批量读取是这些小文件所需的唯一操作，我不需要通过id或其他任何方式读取它们。

Answer 1

从那篇文章：How does the number of partitions affect `wholeTextFiles` and `textFiles`?

wholeTextFiles使用WholeTextFileInputFormat ...因为它会扩展CombineFileInputFormat，所以会尝试合并将一组较小的文件放入一个分区... RDD中的每个记录 ...拥有文件的全部内容

确认{1.6}的Spark 1.6.3 Java API文档 http://spark.apache.org/docs/1.6.3/api/java/index.html

SparkContext
从本地HDFS读取文本文件目录文件系统（在所有节点上都可用）或任何Hadoop支持的文件系统URI。

确认源代码（分支1.6）对课程RDD<scala.Tuple2<String,String>> wholeTextFiles(String path, int minPartitions)的评论
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala

A WholeTextFileInputFormat 阅读全文文件。每个文件都被读取为键值对，其中键是文件路径和该值是文件的全部内容。

为了记录， Hadoop org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat 是在单个Mapper中填充多个小文件的标准方法;它可以在具有属性CombineInputFormat和hive.hadoop.supports.splittable.combineinputformat。

的Hive中使用

Spark hive.input.format 重用了该Hadoop功能，有两个缺点：
（a）你必须使用整个目录，在加载之前不能按名称过滤掉文件（你只能在加载后过滤）
（b）如果需要，您必须通过将每个文件拆分为多个记录来对RDD进行后处理

尽管如此，这似乎是一个可行的解决方案，参见那篇文章：Spark partitioning/cluster enforcing

或者，您可以基于相同的Hadoop wholeTextFiles()构建自己的自定义文件阅读器，参见那篇文章：Apache Spark on YARN: Large number of input data files (combine multiple input files in spark)

HDFS上的Apache Spark：一次读取10k-100k的小文件

问题

问题

P.S。

1 个答案: