Question

我在读取几个数据帧时遇到麻烦。我有这个功能

def readDF(hdfsPath:String, more arguments): DataFrame = {//function goes here}

它为分区采用hdfs路径并返回一个数据帧（它基本上使用spark.read.parquet，但我必须使用它）。我正尝试通过以下方式使用show partitions来阅读其中的一些内容：

val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(partition => {
  val hdfsPath = s"hdfs/path/to/table/$partition"
  readDF(hdfsPath)
}).reduce(_.union(_))

但是它给了我这个错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 3.0 failed 4 times, most recent failure: Lost task 12.3 in stage 3.0 (TID 44, csmlcsworki0021.unix.aacc.corp, executor 1): java.lang.NullPointerException

我认为这是因为我正在对数据帧执行spark.read.parquet内的map，因为如果我更改了该代码的代码

val dfs = spark.sql("show partitions table")
.where(col("partition").contains(someFilterCriteria))
.map(row=> row.getString(0))
.collect
.toSeq
.map(partition => {
  val hdfsPath = s"hdfs/path/to/table/$partition"
  readDF(hdfsPath)
}).reduce(_.union(_))

它可以正确加载数据。但是，如果可能，我不想使用collect。如何实现我的目的？

Answer 1

readDF从HDFS中的镶木地板文件创建数据帧。它必须在驱动程序侧执行。第一个版本是在原始数据帧的行上使用map函数执行的，建议您尝试在执行程序中创建DF，这是不可行的。

为什么我看不到这些数据框

1 个答案: