spark 2.4.0为带有空的右DF的左连接给出了“检测到的隐式笛卡尔积”异常。

时间:2019-05-27 17:13:19

标签: apache-spark-sql

似乎在spark 2.2.1和spark 2.4.0之间,具有空右数据帧的左联接的行为从成功变为返回“ AnalysisException:在逻辑计划之间检测到用于LEFT OUTER的隐式笛卡尔积”。

例如:

val emptyDf = spark.emptyDataFrame
  .withColumn("id", lit(0L))
  .withColumn("brand", lit(""))
val nonemptyDf = ((1L, "a") :: Nil).toDF("id", "size")
val neje = nonemptyDf.join(emptyDf, Seq("id"), "left")
neje.show()

在2.2.1中,结果是

+---+----+-----+
| id|size|brand|
+---+----+-----+
|  1|   a| null|
+---+----+-----+

但是,在2.4.0中,出现以下异常:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

这是后者的完整计划说明:

> neje.explain(true)

== Parsed Logical Plan ==
'Join UsingJoin(LeftOuter,List(id))
:- Project [_1#275L AS id#278L, _2#276 AS size#279]
:  +- LocalRelation [_1#275L, _2#276]
+- Project [id#53L,  AS brand#55]
   +- Project [0 AS id#53L]
      +- LogicalRDD false

== Analyzed Logical Plan ==
id: bigint, size: string, brand: string
Project [id#278L, size#279, brand#55]
+- Join LeftOuter, (id#278L = id#53L)
   :- Project [_1#275L AS id#278L, _2#276 AS size#279]
   :  +- LocalRelation [_1#275L, _2#276]
   +- Project [id#53L,  AS brand#55]
      +- Project [0 AS id#53L]
         +- LogicalRDD false

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for LEFT OUTER join between logical plans
LocalRelation [id#278L, size#279]
and
Project [ AS brand#55]
+- LogicalRDD false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;

其他观察结果:

  • 如果仅左侧数据框为空,则连接成功。
  • 类似的行为更改对于带有空左的右连接是正确的 数据框。
  • 但是,有趣的是,请注意,两个版本均以 如果两个数据框均为空,则为内部联接使用AnalysisException。

这是回归还是设计使然?较早的行为对我来说似乎更正确。我无法在Spark发行说明,Spark Jiira问题或stackoverflow问题中找到任何相关信息。

2 个答案:

答案 0 :(得分:0)

我并没有遇到您的问题,但是至少存在相同的错误,我通过明确允许交叉联接来解决它:

spark.conf.set( "spark.sql.crossJoin.enabled" , "true" )

答案 1 :(得分:0)

我已经多次遇到此问题。我记得最近的一个是因为我在多个动作中使用一个数据帧,因此每次都在重新计算。 一旦将其缓存在源代码中,此错误就消失了。