Spark-Scala无法推断架构(将输入路径验证延迟到DataSource中)

时间:2018-11-11 12:23:52

标签: java scala apache-spark apache-spark-sql

SPARK-26039

在加载空的orc文件夹时。无论如何要绕过这个。

val df = spark.read.format("orc").load(orcFolderPath)

org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 49 elided

遇到此错误可能是Orc Reader试图推断模式,但是当存储库空白文件夹中出现某种情况但必须检查时,我想绕过这种特殊情况。

try {
    spark.read.format("orc").load(path)
    } catch {
        case ex: org.apache.spark.sql.AnalysisException => {
        null
            }
    }

尝试通过这种方式捕获异常。任何其他方式都会有帮助

1 个答案:

答案 0 :(得分:0)

又有一个解决方案了……这也不是最好的解决方案...

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

  def pathStatus(path: String): Boolean = {
      val config: Configuration = new Configuration()
      val fs: FileSystem = FileSystem.get(config)
    if (fs.globStatus(new Path(path)) == null) {
      false
    } else {
      true
    }
  }