我有两个小表,它们按如下所示进行完全外部联接,我认为它应该使用广播联接,但是它选择了“排序合并联接”,我想知道为什么。
test("SparkTest 0461") {
val spark = SparkSession.builder().master("local").appName("SparkTest0460").getOrCreate()
import spark.implicits._
val data1 = Seq((1, 2), (1, 7), (3, 6), (5, 4), (1, 10), (6, 7), (2, 5))
val data2 = Seq(9, 4, 2, 7, 6, 8)
val x = 10L * 1024*1024
spark.sql(s"set spark.sql.autoBroadcastJoinThreshold=$x")
spark.createDataset(data1).toDF("a", "b").createOrReplaceTempView("x")
spark.createDataset(data2).toDF("c").createOrReplaceTempView("y")
val df = spark.sql(
"""
select * from x full join y on a = c
""".stripMargin(' '))
df.explain(true)
}
物理计划如下,表明它正在使用SMJ
== Physical Plan ==
SortMergeJoinExec [a#11], [c#19], FullOuter
:- *(1) SortExec [a#11 ASC NULLS FIRST], false, 0
: +- ShuffleExchangeExec hashpartitioning(a#11, 200)
: +- LocalTableScanExec [a#11, b#12]
+- *(2) SortExec [c#19 ASC NULLS FIRST], false, 0
+- ShuffleExchangeExec hashpartitioning(c#19, 200)
+- LocalTableScanExec [c#19]
答案 0 :(得分:1)
full outer join
不支持BroadcastHashJoin。查看this链接以了解详细信息。
如果用任何受支持的联接替换full outer join
,则物理计划将显示它选择了BroadcastHashJoin。
例如,
val dfOuter = spark.sql(""" select * from x outer join y on a = c """.stripMargin(' '))
dfOuter.explain(true)
给予
== Parsed Logical Plan ==
'Project [*]
+- 'Join Inner, ('a = 'c)
:- 'SubqueryAlias outer
: +- 'UnresolvedRelation `x`
+- 'UnresolvedRelation `y`
== Analyzed Logical Plan ==
a: int, b: int, c: int
Project [a#75, b#76, c#82]
+- Join Inner, (a#75 = c#82)
:- SubqueryAlias outer
: +- SubqueryAlias x
: +- Project [_1#72 AS a#75, _2#73 AS b#76]
: +- LocalRelation [_1#72, _2#73]
+- SubqueryAlias y
+- Project [value#80 AS c#82]
+- LocalRelation [value#80]
== Optimized Logical Plan ==
Join Inner, (a#75 = c#82)
:- LocalRelation [a#75, b#76]
+- LocalRelation [c#82]
== Physical Plan ==
*(1) BroadcastHashJoin [a#75], [c#82], Inner, BuildRight
:- LocalTableScan [a#75, b#76]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [c#82]