谓词下推不适用于Spark Dataframe中的完整外部联接

时间:2019-07-03 20:56:11

标签: apache-spark apache-spark-sql

Spark Dataframe中的完全外部联接似乎没有发生谓词下推

当联接类型为内部时,谓词下推似乎起作用。但是当外表完整时,它不会按下谓词

scala> val left = Seq((0, "a"), (1, "b"), (2, "c")).toDF("id", "val")
2019-07-03 13:46:40 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
left: org.apache.spark.sql.DataFrame = [id: int, val: string]

scala> val right = Seq ((2, "c"), (3, "d")).toDF("id", "val_2")
right: org.apache.spark.sql.DataFrame = [id: int, val_2: string]

scala> val df = left.join(right, Seq("id"), "fullouter")
df: org.apache.spark.sql.DataFrame = [id: int, val: string ... 1 more field]

scala> df.show
+---+----+-----+
| id| val|val_2|
+---+----+-----+
|  1|   b| null|
|  3|null|    d|
|  2|   c|    c|
|  0|   a| null|
+---+----+-----+


scala> val df = left.join(right, Seq("id"), "fullouter").where($"id" === 1)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, val: string ... 1 more field]

scala> df.explain
== Physical Plan ==
*(3) Project [coalesce(id#5, id#14) AS id#33, val#6, val_2#15]
+- *(3) Filter (coalesce(id#5, id#14) = 1)
   +- SortMergeJoin [id#5], [id#14], FullOuter
      :- *(1) Sort [id#5 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(id#5, 200)
      :     +- LocalTableScan [id#5, val#6]
      +- *(2) Sort [id#14 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(id#14, 200)
            +- LocalTableScan [id#14, val_2#15]

scala> val df = left.join(right, Seq("id"), "inner").where($"id" === 1)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, val: string ... 1 more field]

scala> df.explain
== Physical Plan ==
*(2) Project [id#5, val#6, val_2#15]
+- *(2) BroadcastHashJoin [id#5], [id#14], Inner, BuildRight
   :- *(2) Project [_1#2 AS id#5, _2#3 AS val#6]
   :  +- *(2) Filter (_1#2 = 1)
   :     +- LocalTableScan [_1#2, _2#3]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
      +- *(1) Project [_1#11 AS id#14, _2#12 AS val_2#15]
         +- *(1) Filter (_1#11 = 1)
            +- LocalTableScan [_1#11, _2#12]

1 个答案:

答案 0 :(得分:0)

我们可以在内连接中下推谓词,因为结果是一样的。 但是,如果在完全外连接中下推谓词,则会得到不同的结果(与不下推谓词的结果相比)。 因此,在完全外连接中,谓词不能下推。 你会在 mysql 或 postgresql 中找到同样的东西。