Question

我能够通过spark在EMR集群中成功测试以下代码。但是我无法在intelliJ中用本地文件系统编写单元测试用例。谁能帮助我在下面的代码中如何在intelliJ中指定本地文件系统。

在EMR集群中工作

FileSystem.get(new URI("s3n://bucket"), sc.hadoopConfiguration).exists(new Path("/path_to_check"))

不适用于智能J。它总是返回false

FileSystem.get(new URI("file://somelocal/bucket"), sc.hadoopConfiguration).exists(new Path("/some/local/path_to_check"))

Answer 1

您可以使用org.apache.hadoop.fs.FileSystem

def isFileExists(path: String, pattern: String)(implicit spark: SparkSession): Boolean = {
    val fixedPath = path.stripSuffix("/") + "/"
    val conf = spark.sparkContext.hadoopConfiguration
    val fs = FileSystem.get(new URI(path), conf)
    val reg = new Regex(pattern)

    try {
      val files = fs.listFiles(new Path(fixedPath), true)
      var flag = false
      // hack because listFiles returns RemoteIterator which not an inheritor of java.util.Iterator
      while (files.hasNext) {
        reg.findFirstMatchIn(files.next().toString) match {
          case Some(_) => flag = true
          case None =>
        }
      }
      flag
    } catch {
      // if dir doesn't exist
      case _: java.io.FileNotFoundException => false
      case e: Throwable => throw e
    } finally {
      fs.close()
    }
  }

它可与s3，hdfs和本地文件系统一起使用，并且您可以编写单元测试

通过Spark对本地文件系统中是否存在文件进行单元测试

1 个答案: