Spark 1.5.2:org.apache.spark.sql.AnalysisException:unresolved operator'Union;

时间:2016-07-29 04:34:18

标签: apache-spark

我有两个数据框df1df2。它们都有以下架构:

 |-- ts: long (nullable = true)
 |-- id: integer (nullable = true)
 |-- managers: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- projects: array (nullable = true)
 |    |-- element: string (containsNull = true)

df1是从avro文件创建的,df2来自等效的镶木地板文件。但是,如果我执行df1.unionAll(df2).show(),则会收到以下错误:

    org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
    at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
    at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)

3 个答案:

答案 0 :(得分:21)

我遇到了同样的情况,事实证明不仅字段需要相同,而且你需要在两个数据帧中保持字段的完全相同的顺序,以使其工作。

答案 1 :(得分:3)

这已经过时了,已经有了一些答案,但我在试图建立两个数据帧的联合时遇到了这个问题,如...

//Join 2 dataframes
val df = left.unionAll(right)

正如其他人所说,订单很重要。因此,只需按与左数据框列相同的顺序选择右列

//Join 2 dataframes, but take columns in the same order    
val df = left.unionAll(right.select(left.columns.map(col):_*))

答案 2 :(得分:2)

我在github上找到了以下PR:

https://github.com/apache/spark/pull/11333

这与UDF(用户定义的函数)列有关,这些列在联合期间未正确处理,因此会导致联合失败。 PR修复了它,但它还没有引发1.6.2,我还没有检查spark 2.x

如果您仍然遇到1.6.x愚蠢的工作,请将DataFrame映射到RDD并返回DataFrame

// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
  .map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
  .toDF("id", "tags") // RDD --> back to DF but now without UDF column

// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()