Question

我的HDFS文件路径包含我想在Spark中访问的元数据，例如：

sc.newAPIHadoopFile("hdfs://.../*"), ...)
  .map( rdd => /* access hdfs path here */ )

在Hadoop中，我可以通过FileSplit.getPath()访问整个拆分的路径。我可以在Spark中做任何类似的事情，或者我是否必须将路径String附加到扩展NewHadoopRDD中的每个RDD元素，我认为这可能相当昂贵？

Answer 1

在您提供给map（）方法的Closure中，没有可用的元数据/执行上下文信息。

你可能想要的是

mapPartitionsWithContext

Similar to mapPartitions, but allows accessing information about the processing state within the mapper

然后你可以做类似

的事情

import org.apache.spark.TaskContext
def myfunc(tc: TaskContext, iter: Iterator[Int]) : Iterator[Int] = {
  tc.addOnCompleteCallback(() => println(
    "Partition: "     + tc.partitionId +
    ", AttemptID: "   + tc.attemptId   +
    ", Interrupted: " + tc.interrupted))

  iter.toList.filter(_ % 2 == 0).iterator
}
a.mapPartitionsWithContext(myfunc).collect

UPDATE 以前的解决方案不提供HDFS文件名。您可能需要执行以下操作：

创建扩展FileInputFormat
的自定义InputFormat
创建一个自定义RecordReader，每行输出与InputSplit关联的文件，然后输出每行的实际值

在你的火花映射器中，你将解析出第一个现在存在hdfs文件名的字段，而映射器的其余部分仍然是   相同

在RDD方法中访问HDFS输入拆分路径

1 个答案: