如何将路径列表传递给spark.read.load?

时间:2018-06-16 17:53:38

标签: scala apache-spark apache-spark-sql

我可以通过将多个路径传递给load方法来一次加载多个文件,例如

spark.read
  .format("com.databricks.spark.avro")
  .load(
    "/data/src/entity1/2018-01-01",
    "/data/src/entity1/2018-01-12",
    "/data/src/entity1/2018-01-14")

我想首先准备一个路径列表并将它们传递给load方法,但是我收到以下编译错误:

val paths = Seq(
  "/data/src/entity1/2018-01-01",
  "/data/src/entity1/2018-01-12",
  "/data/src/entity1/2018-01-14")
spark.read.format("com.databricks.spark.avro").load(paths)

<console>:29: error: overloaded method value load with alternatives:
  (paths: String*)org.apache.spark.sql.DataFrame <and>
  (path: String)org.apache.spark.sql.DataFrame
 cannot be applied to (List[String])spark.read.format("com.databricks.spark.avro").load(paths)

为什么呢?如何将路径列表传递给load方法?

4 个答案:

答案 0 :(得分:5)

您只需要一个 splat 运算符(_*paths列表为

spark.read.format("com.databricks.spark.avro").load(paths: _*)

答案 1 :(得分:2)

: _*方法支持varargs类型的参数,而不是列表类型。因此,您已明确将列表转换为在加载函数中添加spark.read.format("com.databricks.spark.avro").load(paths: _*) 的varargs。

var variableName = function(){//some content};  
onClick={variableName}

答案 2 :(得分:0)

您无需创建列表。你可以这样做

val df=spark.read.format("com.databricks.spark.avro").option("header","true").load("/data/src/entity1/*")

答案 3 :(得分:0)

此外,您可以从Spark代码源(ResolvedDataSource.scala)使用paths选项:

val paths = {
            if (caseInsensitiveOptions.contains("paths") &&
              caseInsensitiveOptions.contains("path")) {
              throw new AnalysisException(s"Both path and paths options are present.")
            }
            caseInsensitiveOptions.get("paths")
              .map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
              .getOrElse(Array(caseInsensitiveOptions("path")))
              .flatMap{ pathString =>
                val hdfsPath = new Path(pathString)
                val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
                val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
                SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
              }
          }

那么简单:

sqlContext.read.option("paths", paths.mkString(",")).load()

可以解决问题。