Apache Spark - 为什么withColumn DataFrame的联合顺序会产生不同的连接结果?

时间:2017-04-21 06:58:32

标签: apache-spark apache-spark-sql spark-dataframe

环境:

  • 操作系统:Windows 7
  • Spark:版本2.1.0
  • Scala:2.11.8
  • Java:1.8

Spark Shell REPL

scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]

scala> val statC = Seq((1)).toDF("sid").withColumn("stype", lit("A"))
statC: org.apache.spark.sql.DataFrame = [sid: int, stype: string]

scala> val statD  = Seq((2)).toDF("sid").withColumn("stype", lit("B"))
statD: org.apache.spark.sql.DataFrame = [sid: int, stype: string]

scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10|   A|  1|    A|
+---+----+---+-----+

scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
+---+----+---+-----+

statCstatDstype生成列WithColumn,上面的REPL显示 statC.union(statD)statD.union(statC)使联接结果不同。

我解释了两个连接的物理平原

scala> info.join(statC.union(statD), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#420 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
:  +- *Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))
:     +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
   +- Union
      :- LocalTableScan [sid#420, stype#423]
      +- LocalTableScan [sid#430, stype#433]

scala> info.join(statD.union(statC), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0)], [cast(sid#430 as double)], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
:  +- *Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))
:     +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double)))
   +- Union
      :- LocalTableScan [sid#430, stype#433]
      +- LocalTableScan [sid#420, stype#423]

解释结果显示,statCstatD的联合顺序使Filter中的BroadcastHashJoin条件不同:

statC.union(statD)时,过滤条件为:

Filter ((isnotnull(_2#340) && ((A <=> _2#340) || (B <=> _2#340))) && (_2#340 = A))

statD.union(statC)时,过滤条件为:

Filter ((isnotnull(_2#340) && ((B <=> _2#340) || (A <=> _2#340))) && (_2#340 = B))

但是当生成两个联合的DataFrame而没有withColumn时,联合顺序对连接结果没有影响。

scala> val info = Seq((10, "A"), (100, "B")).toDF("id", "type")
info: org.apache.spark.sql.DataFrame = [id: int, type: string]

scala> val statA = Seq((1, "A")).toDF("sid", "stype")
statA: org.apache.spark.sql.DataFrame = [sid: int, stype: string]

scala> val statB = Seq((2, "B")).toDF("sid", "stype")
statB: org.apache.spark.sql.DataFrame = [sid: int, stype: string]

scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10|   A|  1|    A|
+---+----+---+-----+

scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").show
+---+----+---+-----+
| id|type|sid|stype|
+---+----+---+-----+
| 10|   A|  1|    A|
+---+----+---+-----+

scala> info.join(statA.union(statB), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#352 as double), stype#353], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
:  +- *Filter isnotnull(_2#340)
:     +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
   +- Union
      :- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
      :  +- *Filter isnotnull(_2#350)
      :     +- LocalTableScan [_1#349, _2#350]
      +- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
         +- *Filter isnotnull(_2#360)
            +- LocalTableScan [_1#359, _2#360]

scala> info.join(statB.union(statA), $"id"/10 === $"sid" and $"type"===$"stype").explain
== Physical Plan ==
*BroadcastHashJoin [(cast(id#342 as double) / 10.0), type#343], [cast(sid#362 as double), stype#363], Inner, BuildRight
:- *Project [_1#339 AS id#342, _2#340 AS type#343]
:  +- *Filter isnotnull(_2#340)
:     +- LocalTableScan [_1#339, _2#340]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as double), input[1, string, true]))
   +- Union
      :- *Project [_1#359 AS sid#362, _2#360 AS stype#363]
      :  +- *Filter isnotnull(_2#360)
      :     +- LocalTableScan [_1#359, _2#360]
      +- *Project [_1#349 AS sid#352, _2#350 AS stype#353]
         +- *Filter isnotnull(_2#350)
            +- LocalTableScan [_1#349, _2#350]

解释显示,在statA / statB中,id/typesid/stype都包含在BroadcastHashJoin中,但在statC / statD中,只有idsid包含在BroadcastHashJoin。

为什么在由withColumn生成的DataFrame上更改union的顺序时,join有不同的语义?

0 个答案:

没有答案