Question

我正在尝试将两个Spark数据帧与不同的列集合在一起。为此，我提到了以下链接： -

How to perform union on two DataFrames with different amounts of columns in spark?

我的代码如下 -

val cols1 = finalDF.columns.toSet
val cols2 = df.columns.toSet
val total = cols1 ++ cols2 
finalDF=finalDF.select(expr(cols1, total):_*).unionAll(df.select(expr(cols2, total):_*))

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

但我遇到的问题是两个数据帧中的某些列是嵌套的。我有StructType和原始类型的列。现在，假设列A（StructType）是df而不是finalDF。但是在expr中，

case _ => lit(null).as(x)

没有使它成为StructType。这就是为什么我无法将它们联合起来的原因。它给了我以下错误 -

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. NullType <> StructType(StructField(_VALUE,StringType,true), StructField(_id,LongType,true)) at the first column of the second table.

我可以在这做什么建议？

Answer 1

我为此使用内置架构推理。 方式更昂贵，但比匹配复杂结构简单得多，可能存在冲突：

spark.read.json(df1.toJSON.union(df2.toJSON))

您还可以同时导入所有文件，并join使用input_file_name从标题中提取的信息。

import org.apache.spark.sql.function

val metadata: DataFrame  // Just metadata from the header
val data: DataFrame      // All files loaded together

metadata.withColumn("file", input_file_name)
  .join(data.withColumn("file", input_file_name), Seq("file"))

Answer 2

df = df1.join(df2, ['each', 'shared', 'column'], how='full')

将使用空值填充缺失的数据。

具有不同列的两个Spark数据帧的联合

2 个答案: