Question

对多个文本文件使用sqlContext.load时，如何防止Spark在多个分区中拆分每个文件？ gzip文件不是问题，我希望它对常规文本文件的工作方式相同。

sc.wholeTextFile可以工作，除了读取整个100MB文件以某种方式需要3G内存，所以我宁愿使用某种流式传输，因为我们有时需要读取更大的文件。

Answer 1

分割性是InputFormat的一项功能。 TextInputFormat具有条件可拆分性，具体取决于源（纯文本，某些压缩文本可以拆分，但gzip基本上不可拆分）。

要获得您想要的行为，您可以将TextInputFormat扩展为您自己的NonSplittingTextInputFormat并覆盖isSplittable方法以始终返回false。然后，您可以通过与sc.textFile中实现的方式类似的代码加载文件：

import org.apache.hadoop.fs.{FileSystem, Path}

class NonSplittingTextInputFormat extends TextInputFormat {
  override protected def isSplitable(context: FileSystem, file: Path): Boolean = false
}

sc.hadoopFile(path, classOf[NonSplittableInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString)

如何防止Spark分割文本文件

1 个答案: