Spark中的Join和withColumn异常

时间:2019-01-21 18:41:00

标签: apache-spark join pyspark apache-spark-sql

我正在尝试加入以下2个数据框:

val df1 = Seq(
      ("Verizon", "USA"),
      ("AT & T", "PK"),
      ("Verizon", "IND")
    ).toDF("Brand", "Country")

    val df2 = Seq(
      (8, "USA"),
      (64, "UK"),
      (-27, "DE")
    ).toDF("TS", "Country")

如果我这样加入,那就行得通

df1.join(df2, Seq("Country")).count

但是当我尝试在连接之前使用 withColumn() lit()(以替换列值)时,它会引发异常:

df1.withColumn("Country", lit("USA")).join(df2, Seq("Country")).count

例外:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LocalRelation
and
Project
+- Filter (isnotnull(_2#680) && (USA = _2#680))
   +- LocalRelation [_1#679, _2#680]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
   ...

当我使用crossjoin时也可以使用:

df1.withColumn("Country", lit("USA")).crossJoin(df2.filter(col("Country") === "USA"))

但是我不明白为什么它不能与简单的 join 一起使用。为什么需要使用交叉连接使其正常工作。任何帮助,将不胜感激。谢谢

1 个答案:

答案 0 :(得分:0)

当您打算使用内部联接时,火花分析器检测到交叉联接条件。

由于交叉联接的成本很高,因此当物理计划检测到查询未明确使用交叉联接的交叉联接状态时,默认行为将引发异常。

之所以发生这种情况,是因为替换了由文字构成的列。

Cross join behavior explanation在user10465355提到的线程中进行了更详细的解释。