我在加入Dataframes / set时看到了spark / sql.AnalysisException,它们本身是以前连接的结果
case class Left(a: Int, b: String, c: Boolean)
case class Right(a: Int, b: String, c: Int)
// Boilerplate
import spark.implicits._
val left = Seq(
Left(1, "1", true),
Left(2, "2", true),
Left(3, "3", true)
).toDF
val right = Seq(
Right(1, "1", 1),
Right(2, "2", 2),
Right(3, "3", 3)
).toDF
val joined1 = left
.join(right, left("a") === right("a"), "inner")
.select(left("a"), right("b"), left("c"))
val joined2 = joined1
.join(left, joined1("b") === left("b"))
实际的错误消息是:
Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) b#4 missing from b#49,b#15,c#5,a#48,c#50,a#3 in operator !Join Inner, (b#15 = b#4);;
!Join Inner, (b#15 = b#4)
:- Project [a#3, b#15, c#5]
: +- Join Inner, (a#3 = a#14)
: :- LocalRelation [a#3, b#4, c#5]
: +- LocalRelation [a#14, b#15, c#16]
+- LocalRelation [a#48, b#49, c#50]
所以我想问题是join2中的列是根据父表和join1(" b")的列表示的,即b#4不存在。
我明白通过对表进行别名并将列引用为functions.column(" leftAlias.column")可以解决这个问题。但是我想将结果倒回到数据集中,所以我可能必须在这之前重新命名列。
有人可以提出更优雅的解决方案/解决方法吗?
非常感谢
特里