Access files that start with underscore in apache spark

时间:2016-07-11 22:02:41

标签: hadoop apache-spark

I am trying to access gz files on s3 that start with _ in Apache Spark. Unfortunately spark deems these files invisible and returns Input path does not exist: s3n:.../_1013.gz. If I remove the underscore it finds the file just fine.

I tried adding a custom PathFilter to the hadoopConfig:

package CustomReader

import org.apache.hadoop.fs.{Path, PathFilter}

class GFilterZip extends PathFilter {
  override def accept(path: Path): Boolean = {
    true
  }
}
// in spark settings
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[CustomReader.GFilterZip], classOf[org.apache.hadoop.fs.PathFilter])

but I still have the same problem. Any ideas?

System: Apache Spark 1.6.0 with Hadoop 2.3

1 个答案:

答案 0 :(得分:2)

以_和开头的文件。是隐藏文件。

并且将始终应用hiddenFileFilter。它添加在方法org.apache.hadoop.mapred.FileInputFormat.listStatus

检查此答案,which files ignored as input by mapper?