广播哈希连接不适用于Spark SQL中的完全连接吗?

时间:2019-03-03 10:21:08

标签: apache-spark

我有两个小表,它们按如下所示进行完全外部联接,我认为它应该使用广播联接,但是它选择了“排序合并联接”,我想知道为什么。

  test("SparkTest 0461") {

    val spark = SparkSession.builder().master("local").appName("SparkTest0460").getOrCreate()
    import spark.implicits._
    val data1 = Seq((1, 2), (1, 7), (3, 6), (5, 4), (1, 10), (6, 7), (2, 5))
    val data2 = Seq(9, 4, 2, 7, 6, 8)
    val x = 10L * 1024*1024
    spark.sql(s"set spark.sql.autoBroadcastJoinThreshold=$x")
    spark.createDataset(data1).toDF("a", "b").createOrReplaceTempView("x")
    spark.createDataset(data2).toDF("c").createOrReplaceTempView("y")
    val df = spark.sql(
      """
         select * from x full join y on a = c
      """.stripMargin(' '))
      df.explain(true)
  }

物理计划如下,表明它正在使用SMJ

== Physical Plan ==
SortMergeJoinExec [a#11], [c#19], FullOuter
:- *(1) SortExec [a#11 ASC NULLS FIRST], false, 0
:  +- ShuffleExchangeExec hashpartitioning(a#11, 200)
:     +- LocalTableScanExec [a#11, b#12]
+- *(2) SortExec [c#19 ASC NULLS FIRST], false, 0
   +- ShuffleExchangeExec hashpartitioning(c#19, 200)
      +- LocalTableScanExec [c#19]

1 个答案:

答案 0 :(得分:1)

full outer join不支持BroadcastHashJoin。查看this链接以了解详细信息。

如果用任何受支持的联接替换full outer join,则物理计划将显示它选择了BroadcastHashJoin。

例如,

val dfOuter = spark.sql(""" select * from x outer join y on a = c """.stripMargin(' '))
dfOuter.explain(true)

给予

== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner, ('a = 'c)
   :- 'SubqueryAlias outer
   :  +- 'UnresolvedRelation `x`
   +- 'UnresolvedRelation `y`

== Analyzed Logical Plan ==
a: int, b: int, c: int
Project [a#75, b#76, c#82]
+- Join Inner, (a#75 = c#82)
   :- SubqueryAlias outer
   :  +- SubqueryAlias x
   :     +- Project [_1#72 AS a#75, _2#73 AS b#76]
   :        +- LocalRelation [_1#72, _2#73]
   +- SubqueryAlias y
      +- Project [value#80 AS c#82]
         +- LocalRelation [value#80]

== Optimized Logical Plan ==
Join Inner, (a#75 = c#82)
:- LocalRelation [a#75, b#76]
+- LocalRelation [c#82]

== Physical Plan ==
*(1) BroadcastHashJoin [a#75], [c#82], Inner, BuildRight
:- LocalTableScan [a#75, b#76]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
   +- LocalTableScan [c#82]