Question

我正在编写一个内部API，以便使用Spark自定义测量数据格式。由于使用的模式因测量数据的类型而异，我使用的是DataFrame API，我使用Hadoop的FileInputFormat旁边sc.newAPIHadoopFile来读取它们，作为测量数据格式不能简化为简单的文本文件。

在我的API中，我想返回空的DataFrame而不是抛出No input paths specified in job异常，所以我首先采用了天真的方法：

try
  spark
    .sparkContext
    .newAPIHadoopFile(inPath,
                      classOf[OneOfMyCustomMeasuringDataInputFormat],
                      classOf[SomeAppropriateKeyWritable],
                      classOf[SomeAppropriateValueWritable],
                      conf)
    .map {
           case (k, v) => SomeAppropriateRecordCaseClass(/* data from k and v */)
         }
    .toDF
  catch {
    case e: IOException if e.getMessage.equals("No input paths specified in job") =>
    spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
                          // Some implicits I made to simplify schema construction:
                          ("foo" of SomeType) ::
                          ("bar" of SomeOtherType) ::
                          // more ::
                          Nil : StructType)
  }

但是，由于RDD是惰性的，当没有输入路径时，在真正访问DF之前不会触发此异常。

目前，我在所有FileInputFormat处理此问题，并指示可能在将来添加更多格式的同事在listStatus方法中检查此异常并返回空列表，但我想知道这是否可以做得更多。

Answer 1

在深入研究Hadoop和Spark的源代码之后，我看到它正在编码，目前最好的解决方案是在FileInputFormat中处理这个问题。我添加了一个额外的选项，放在我的Hadoop Configuration中，名为FileInputFormat.dontThrowOnEmptyPaths，我的自定义输入格式就是这个。它们会捕获上面我的代码示例中的相应IOException，并且只有在未设置此选项或设置为false时才重新抛出它。

这是一种解决方法，I posted an enhancement suggestion to the JIRA about this.

如何禁止“在作业中指定无输入路径”并返回空的RDD / DataFrame？

1 个答案: