我们是否可以将Spark的CBO(基于成本的优化器)与本机镶木地板或内存数据帧一起使用?

时间:2019-03-18 01:03:17

标签: apache-spark parquet cbo

说我想用内部联接来联接3个表A,B,C,而C很小。

#DUMMY EXAMPLE with IN-MEMORY table, but same issue if load table using spark.read.parquet("")
var A = (1 to 1000000).toSeq.toDF("A")
var B = (1 to 1000000).toSeq.toDF("B")
var C = (1 to 10).toSeq.toDF("C")

我无法控制将联接带给我的顺序:

CASE1 = A.join(B,expr("A=B"),"inner").join(C,expr("A=C"),"inner")
CASE2 = A.join(C,expr("A=C"),"inner").join(B,expr("A=B"),"inner")

同时运行都表明CASE1的运行速度比CASE2慢30-40%。

所以问题是:如何利用Spark的CBO自动将CASE1转换为CASE2,以用于内存表或从Spark镶木地板读取器加载的表?

我尝试做:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spark.sql.cbo.enabled", "true")
A.createOrReplaceTempView("A")
spark.sql("ANALYZE TABLE A COMPUTE STATISTICS")

但这会抛出:

org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'a' not found in database 'default'

是否有其他无需在Hive中保存表格即可激活CBO的方法?


附件:

  1. 即使使用spark.conf.set(“ spark.sql.cbo.enabled”,“ true”),SparkWebUI中也不会显示成本估算
  2. 显示CASE1.explain!= CASE2.explain

CASE1.explain

== Physical Plan ==
*(5) SortMergeJoin [A#3], [C#13], Inner
:- *(3) SortMergeJoin [A#3], [B#8], Inner
:  :- *(1) Sort [A#3 ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(A#3, 200)
:  :     +- LocalTableScan [A#3]
:  +- *(2) Sort [B#8 ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(B#8, 200)
:        +- LocalTableScan [B#8]
+- *(4) Sort [C#13 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(C#13, 200)
      +- LocalTableScan [C#13]

CASE2.explain

== Physical Plan ==
*(5) SortMergeJoin [A#3], [B#8], Inner
:- *(3) SortMergeJoin [A#3], [C#13], Inner
:  :- *(1) Sort [A#3 ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(A#3, 200)
:  :     +- LocalTableScan [A#3]
:  +- *(2) Sort [C#13 ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(C#13, 200)
:        +- LocalTableScan [C#13]
+- *(4) Sort [B#8 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(B#8, 200)
      +- LocalTableScan [B#8]

1 个答案:

答案 0 :(得分:0)

不,简短的答案是这是不可能的。

following things are evaluated False很好地概述了持久数据存储中的可能和要点。