我想在两个配置单元表之间执行广播连接。一个具有约300-400mb的数据,另一个具有1mb的数据。我要播放小桌子。
当我使用spark.read.table(“ tableA”)读取表时,explain方法显示sortmergeJoin。但是,当我使用spark.read.parquet(“ tableALocation”)阅读时,它显示了广播联接。
使用配置单元表sortMergeJoin执行连接:-
scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_col_a: double, x_col_b: double ... 49 more fields]
scala> val bigtable = spark.read.table("test.bigtable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]
scala> val joinTable = bigtable.join(smallTable,bigtable("quantile_") === smallTable("quantile_"),"inner")
c: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]
scala> joinTable.explain
== Physical Plan ==
SortMergeJoin [quantile_#7502], [quantile_#7397], Inner
:- Sort [quantile_#7502 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(quantile_#7502, 200)
但是,如果我直接阅读实木复合地板文件,则自动广播加入。
val smallTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_crosssellaggregatecomponent")
smallTableFile: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]
scala> val bigTableFile = spark.read.parquet("/apps/hive/warehouse/test.db/test_1_prevcurrentlateststringinputjoincomponent")
bigTableFile: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]
scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bi("quantile_"),"inner")
bigTableFile bin bitwiseNOT
scala> val join = bigTableFile.join(smallTableFile,smallTableFile("quantile_") === bigTableFile("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]
scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#8562], [quantile_#8508], Inner, BuildRight
我还观察到,如果我坚持使用smallTable自动广播联接。
scala> val smallTable = spark.read.table("test.smallTable")
smallTable: org.apache.spark.sql.DataFrame = [x_vce_offnet_moc_drtn_secs: double, x_sms_onnet_moc_billed_rev: double ... 49 more fields]
scala> val bigTable = spark.read.table("test.bigTable")
b: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 109 more fields]
scala> val join = bigTable.join(smallTable,smallTable("quantile_") === bigTable("quantile_"),"inner")
join: org.apache.spark.sql.DataFrame = [customer: bigint, quantile_: int ... 160 more fields]
scala> join.explain
19/07/18 10:30:36 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
== Physical Plan ==
SortMergeJoin [quantile_#154], [quantile_#49], Inner
:- Sort [quantile_#154 ASC NULLS FIRST], false, 0
scala> smallTable.persist
res1: smallTable.type = [x_col_a: double, x__col_b: double ... 49 more fields]
scala> smallTable.count
res2: Long = 10
scala> join.explain
== Physical Plan ==
BroadcastHashJoin [quantile_#154], [quantile_#49], Inner, BuildRight
我知道我们可以使用sql.functions.broadcast强制广播。 我想知道为什么在直接读取拼花时不进行自动广播。