Question

＆＃34; old＆＃34; SparkContext.hadoopFile采用minPartitions参数，这是对分区数量的提示：

def hadoopFile[K, V](
  path: String,
  inputFormatClass: Class[_ <: InputFormat[K, V]],
  keyClass: Class[K],
  valueClass: Class[V],
  minPartitions: Int = defaultMinPartitions
  ): RDD[(K, V)]

但SparkContext.newAPIHadoopFile上没有这样的论点：

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
  path: String,
  fClass: Class[F],
  kClass: Class[K],
  vClass: Class[V],
  conf: Configuration = hadoopConfiguration): RDD[(K, V)]

实际上mapred.InputFormat.getSplits会提取一个提示参数，但mapreduce.InputFormat.getSplits需要一个JobContext。通过新API影响拆分数量的方法是什么？

我尝试在mapreduce.input.fileinputformat.split.maxsize对象上设置fs.s3n.block.size和Configuration，但它们没有效果。我正在尝试从s3n加载4.5 GB文件，并将其加载到单个任务中。

https://issues.apache.org/jira/browse/HADOOP-5861是相关的，但它表明我应该已经看到多个拆分，因为默认块大小为64 MB。

Answer 1

函数newApiHadoopFile允许您传递配置对象，以便您可以设置mapred.max.split.size。

即使这是在mapred命名空间中，因为似乎没有新的选项我会想象新的API会尊重变量。

如何设置newAPIHadoopFile的分区数？

1 个答案: