我能够通过spark在EMR集群中成功测试以下代码。但是我无法在intelliJ中用本地文件系统编写单元测试用例。谁能帮助我在下面的代码中如何在intelliJ中指定本地文件系统。
在EMR集群中工作
FileSystem.get(new URI("s3n://bucket"), sc.hadoopConfiguration).exists(new Path("/path_to_check"))
不适用于智能J。它总是返回false
FileSystem.get(new URI("file://somelocal/bucket"), sc.hadoopConfiguration).exists(new Path("/some/local/path_to_check"))
答案 0 :(得分:0)
您可以使用org.apache.hadoop.fs.FileSystem
def isFileExists(path: String, pattern: String)(implicit spark: SparkSession): Boolean = {
val fixedPath = path.stripSuffix("/") + "/"
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(new URI(path), conf)
val reg = new Regex(pattern)
try {
val files = fs.listFiles(new Path(fixedPath), true)
var flag = false
// hack because listFiles returns RemoteIterator which not an inheritor of java.util.Iterator
while (files.hasNext) {
reg.findFirstMatchIn(files.next().toString) match {
case Some(_) => flag = true
case None =>
}
}
flag
} catch {
// if dir doesn't exist
case _: java.io.FileNotFoundException => false
case e: Throwable => throw e
} finally {
fs.close()
}
}
它可与s3,hdfs和本地文件系统一起使用,并且您可以编写单元测试