加入后Spark解析属性丢失

时间:2018-04-16 10:17:44

标签: apache-spark apache-spark-sql

我在加入Dataframes / set时看到了spark / sql.AnalysisException,它们本身是以前连接的结果

    case class Left(a: Int, b: String, c: Boolean)

    case class Right(a: Int, b: String, c: Int)

    // Boilerplate

    import spark.implicits._

    val left = Seq(
      Left(1, "1", true),
      Left(2, "2", true),
      Left(3, "3", true)
    ).toDF

    val right = Seq(
      Right(1, "1", 1),
      Right(2, "2", 2),
      Right(3, "3", 3)
    ).toDF

    val joined1 = left
      .join(right, left("a") === right("a"), "inner")
      .select(left("a"), right("b"), left("c"))

    val joined2 = joined1
      .join(left, joined1("b") === left("b"))

实际的错误消息是:

Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) b#4 missing from b#49,b#15,c#5,a#48,c#50,a#3 in operator !Join Inner, (b#15 = b#4);;
!Join Inner, (b#15 = b#4)
:- Project [a#3, b#15, c#5]
:  +- Join Inner, (a#3 = a#14)
:     :- LocalRelation [a#3, b#4, c#5]
:     +- LocalRelation [a#14, b#15, c#16]
+- LocalRelation [a#48, b#49, c#50]

所以我想问题是join2中的列是根据父表和join1(" b")的列表示的,即b#4不存在。

我明白通过对表进行别名并将列引用为functions.column(" leftAlias.column")可以解决这个问题。但是我想将结果倒回到数据集中,所以我可能必须在这之前重新命名列。

有人可以提出更优雅的解决方案/解决方法吗?

非常感谢

特里

0 个答案:

没有答案