Question

我想知道spark.sql.autoBroadcastJoinThreshold属性是否可用于在所有工作节点上广播较小的表（同时进行连接），即使连接方案使用数据集API连接而不是使用Spark SQL。

如果我的大表是250 Gigs而Smaller是20 Gigs，我是否需要设置此配置：spark.sql.autoBroadcastJoinThreshold = 21 Gigs（可能）以便将整个表/ Dataset发送给所有人工人节点？

实施例：

数据集API加入

val result = rawBigger.as("b").join(
  broadcast(smaller).as("s"),
  rawBigger(FieldNames.CAMPAIGN_ID) === smaller(FieldNames.CAMPAIGN_ID), 
  "left_outer"
)

SQL

select * 
from rawBigger_table b, smaller_table s
where b.campign_id = s.campaign_id;

Answer 1

首先，spark.sql.autoBroadcastJoinThreshold和broadcast提示是不同的机制。即使autoBroadcastJoinThreshold被禁用，设置broadcast提示也会优先。使用默认设置：

spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

String = 10485760

val df1 = spark.range(100)
val df2 = spark.range(100)

Spark将使用autoBroadcastJoinThreshold并自动广播数据：

df1.join(df2, Seq("id")).explain

== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
   :- *Range (0, 100, step=1, splits=Some(8))
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *Range (0, 100, step=1, splits=Some(8))

当我们禁用自动广播时，Spark将使用标准SortMergeJoin：

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
df1.join(df2, Seq("id")).explain

== Physical Plan ==
*Project [id#0L]
+- *SortMergeJoin [id#0L], [id#3L], Inner
   :- *Sort [id#0L ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(id#0L, 200)
   :     +- *Range (0, 100, step=1, splits=Some(8))
   +- *Sort [id#3L ASC NULLS FIRST], false, 0
      +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200)

但可以强制使用BroadcastHashJoin broadcast提示：

df1.join(broadcast(df2), Seq("id")).explain

== Physical Plan ==
*Project [id#0L]
+- *BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
   :- *Range (0, 100, step=1, splits=Some(8))
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
      +- *Range (0, 100, step=1, splits=Some(8))

SQL有自己的提示格式（类似于Hive中使用的格式）：

df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

spark.sql(
 "SELECT  /*+ MAPJOIN(df2) */ * FROM df1 JOIN df2 ON df1.id = df2.id"
).explain

== Physical Plan ==
*BroadcastHashJoin [id#0L], [id#3L], Inner, BuildRight
:- *Range (0, 100, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   +- *Range (0, 100, step=1, splits=8)

因此，要回答您的问题，autoBroadcastJoinThreshold在使用Dataset API时适用，但在使用明确的broadcast提示时则不相关。

此外，广播大型物体不太可能提供任何性能提升，并且在实践中通常会降低性能并导致稳定性问题。请记住，广播对象必须首先获取驱动程序，然后发送给每个工作者，最后加载到内存中。

Answer 2

只是为了分享更多细节（从代码中）到@user6910411的优秀答案。

引用source code（格式化我的）：

spark.sql.autoBroadcastJoinThreshold 配置在执行连接时将广播到所有工作节点的表的最大字节数。

通过将此值设置为-1，可以禁用广播。

请注意，目前的统计信息仅支持已运行命令ANALYZE TABLE COMPUTE STATISTICS noscan的Hive Metastore表，以及基于文件的数据源表，其中统计信息直接计算在数据文件上。

spark.sql.autoBroadcastJoinThreshold默认为10M（即10L * 1024 * 1024），Spark会检查要使用的联接（请参阅JoinSelection执行计划策略）。

6 不同的加入选择，其中包括广播（使用BroadcastHashJoinExec或BroadcastNestedLoopJoinExec物理运算符）。

当有加入密钥并且其中一个成立时，

BroadcastHashJoinExec将被选中：

加入是CROSS，INNER，LEFT ANTI，LEFT OUTER，LEFT SEMI和右连接方之一可以广播，即大小小于spark.sql.autoBroadcastJoinThreshold
加入是CROSS，INNER和RIGHT OUTER之一，左侧加入方可以广播，即大小小于spark.sql.autoBroadcastJoinThreshold

当没有加入密钥并且上述BroadcastNestedLoopJoinExec条件之一成立时，

BroadcastHashJoinExec将会被选中。

换句话说，Spark会自动选择正确的连接，包括基于BroadcastHashJoinExec属性的spark.sql.autoBroadcastJoinThreshold（以及其他要求），还有连接类型。

spark.sql.autoBroadcastJoinThreshold是否可以使用数据集的连接运算符进行连接？

2 个答案: