我可以通过将多个路径传递给load
方法来一次加载多个文件,例如
spark.read
.format("com.databricks.spark.avro")
.load(
"/data/src/entity1/2018-01-01",
"/data/src/entity1/2018-01-12",
"/data/src/entity1/2018-01-14")
我想首先准备一个路径列表并将它们传递给load
方法,但是我收到以下编译错误:
val paths = Seq(
"/data/src/entity1/2018-01-01",
"/data/src/entity1/2018-01-12",
"/data/src/entity1/2018-01-14")
spark.read.format("com.databricks.spark.avro").load(paths)
<console>:29: error: overloaded method value load with alternatives:
(paths: String*)org.apache.spark.sql.DataFrame <and>
(path: String)org.apache.spark.sql.DataFrame
cannot be applied to (List[String])spark.read.format("com.databricks.spark.avro").load(paths)
为什么呢?如何将路径列表传递给load
方法?
答案 0 :(得分:5)
您只需要一个 splat 运算符(_*
)paths
列表为
spark.read.format("com.databricks.spark.avro").load(paths: _*)
答案 1 :(得分:2)
: _*
方法支持varargs类型的参数,而不是列表类型。因此,您已明确将列表转换为在加载函数中添加spark.read.format("com.databricks.spark.avro").load(paths: _*)
的varargs。
var variableName = function(){//some content};
onClick={variableName}
答案 2 :(得分:0)
您无需创建列表。你可以这样做
val df=spark.read.format("com.databricks.spark.avro").option("header","true").load("/data/src/entity1/*")
答案 3 :(得分:0)
此外,您可以从Spark代码源(ResolvedDataSource.scala)使用paths
选项:
val paths = {
if (caseInsensitiveOptions.contains("paths") &&
caseInsensitiveOptions.contains("path")) {
throw new AnalysisException(s"Both path and paths options are present.")
}
caseInsensitiveOptions.get("paths")
.map(_.split("(?<!\\\\),").map(StringUtils.unEscapeString(_, '\\', ',')))
.getOrElse(Array(caseInsensitiveOptions("path")))
.flatMap{ pathString =>
val hdfsPath = new Path(pathString)
val fs = hdfsPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
SparkHadoopUtil.get.globPathIfNecessary(qualified).map(_.toString)
}
}
那么简单:
sqlContext.read.option("paths", paths.mkString(",")).load()
可以解决问题。