我有两个表A和B,想获取A的子集,其键k也位于B中。
一个选项是使用加入
select A.*
from A
join B on A.k = B.k
另一个是
select A.*
from A
where exists (select *, B.k from B where A.k = B.k)
如果B中的字段k是唯一的,我觉得它们是相同的。对于火花来说,子查询是否真的考虑存在?
答案 0 :(得分:2)
最简单,最真实的方法是explain
两个查询并比较其物理计划。
scala> println(spark.version)
2.4.0
scala> sql("select A.* from A join B on A.k = B.k").explain
== Physical Plan ==
*(2) Project [k#10L]
+- *(2) BroadcastHashJoin [k#10L], [k#6L], Inner, BuildRight
:- *(2) Project [id#8L AS k#10L]
: +- *(2) Range (0, 10, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Project [id#4L AS k#6L]
+- *(1) Range (0, 10, step=1, splits=8)
scala> sql("""select * from a where exists (select *, B.k from B where A.k = B.k)""").explain
== Physical Plan ==
*(2) Project [id#8L AS k#10L]
+- *(2) BroadcastHashJoin [id#8L], [k#6L], LeftSemi, BuildRight
:- *(2) Range (0, 10, step=1, splits=8)
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
+- *(1) Project [id#4L AS k#6L, id#4L AS k#6L]
+- *(1) Range (0, 10, step=1, splits=8)
它们看起来相似,不是吗?
我觉得他们是一样的
它们如上所述。