我正在尝试加入以下2个数据框:
val df1 = Seq(
("Verizon", "USA"),
("AT & T", "PK"),
("Verizon", "IND")
).toDF("Brand", "Country")
val df2 = Seq(
(8, "USA"),
(64, "UK"),
(-27, "DE")
).toDF("TS", "Country")
如果我这样加入,那就行得通
df1.join(df2, Seq("Country")).count
但是当我尝试在连接之前使用 withColumn()和 lit()(以替换列值)时,它会引发异常:
df1.withColumn("Country", lit("USA")).join(df2, Seq("Country")).count
例外:
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LocalRelation
and
Project
+- Filter (isnotnull(_2#680) && (USA = _2#680))
+- LocalRelation [_1#679, _2#680]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
...
当我使用crossjoin
时也可以使用:
df1.withColumn("Country", lit("USA")).crossJoin(df2.filter(col("Country") === "USA"))
但是我不明白为什么它不能与简单的 join 一起使用。为什么需要使用交叉连接使其正常工作。任何帮助,将不胜感激。谢谢
答案 0 :(得分:0)
当您打算使用内部联接时,火花分析器检测到交叉联接条件。
由于交叉联接的成本很高,因此当物理计划检测到查询未明确使用交叉联接的交叉联接状态时,默认行为将引发异常。
之所以发生这种情况,是因为替换了由文字构成的列。
Cross join behavior explanation在user10465355提到的线程中进行了更详细的解释。