Spark SQL - 将查询计划更改为浓密计划

时间:2016-06-15 13:24:05

标签: apache-spark query-optimization apache-spark-sql

我想在Spark SQL中执行以下sql查询:

sqlContext.sql("SELECT c.name, c.nationkey, n.name, l.orderkey, o.orderdate 
            FROM customers c, nations n, orders o, lineitems l 
            WHERE n.nationkey=20 AND c.nationkey=n.nationkey AND c.custkey=o.custkey AND o.orderkey=l.orderkey");

因此,要执行3个连接。

Catalyst,Spark SQL中的查询分析器和优化器,返回以下优化的逻辑和物理计划:

== Optimized Logical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- Join Inner, Some((orderkey#32 = orderkey#14))
   :- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
   :  +- Join Inner, Some((custkey#3 = custkey#30))
   :     :- Project [name#25,custkey#3,nationkey#6,name#5]
   :     :  +- Join Inner, Some((nationkey#6 = nationkey#26))
   :     :     :- Project [custkey#3,nationkey#6,name#5]
   :     :     :  +- LogicalRDD [acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7], MapPartitionsRDD[3] at createDataFrame at Query.java:66
   :     :     +- Project [nationkey#26,name#25]
   :     :        +- Filter (nationkey#26 = 20)
   :     :           +- LogicalRDD [comment#24,name#25,nationkey#26,regionkey#27], MapPartitionsRDD[11] at createDataFrame at Query.java:76
   :     +- Project [orderkey#32,orderdate#31,custkey#30]
   :        +- LogicalRDD [clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36], MapPartitionsRDD[15] at createDataFrame at Query.java:81
   +- Project [orderkey#14]
      +- LogicalRDD [comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23], MapPartitionsRDD[7] at createDataFrame at Query.java:71


== Physical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- SortMergeJoin [orderkey#32], [orderkey#14]
   :- Sort [orderkey#32 ASC], false, 0
   :  +- TungstenExchange hashpartitioning(orderkey#32,200), None
   :     +- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
   :        +- SortMergeJoin [custkey#3], [custkey#30]
   :           :- Sort [custkey#3 ASC], false, 0
   :           :  +- TungstenExchange hashpartitioning(custkey#3,200), None
   :           :     +- Project [name#25,custkey#3,nationkey#6,name#5]
   :           :        +- SortMergeJoin [nationkey#6], [nationkey#26]
   :           :           :- Sort [nationkey#6 ASC], false, 0
   :           :           :  +- TungstenExchange hashpartitioning(nationkey#6,200), None
   :           :           :     +- Project [custkey#3,nationkey#6,name#5]
   :           :           :        +- Scan ExistingRDD[acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7] 
   :           :           +- Sort [nationkey#26 ASC], false, 0
   :           :              +- TungstenExchange hashpartitioning(nationkey#26,200), None
   :           :                 +- Project [nationkey#26,name#25]
   :           :                    +- Filter (nationkey#26 = 20)
   :           :                       +- Scan ExistingRDD[comment#24,name#25,nationkey#26,regionkey#27] 
   :           +- Sort [custkey#30 ASC], false, 0
   :              +- TungstenExchange hashpartitioning(custkey#30,200), None
   :                 +- Project [orderkey#32,orderdate#31,custkey#30]
   :                    +- Scan ExistingRDD[clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36] 
   +- Sort [orderkey#14 ASC], false, 0
      +- TungstenExchange hashpartitioning(orderkey#14,200), None
         +- Project [orderkey#14]
            +- Scan ExistingRDD[comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23]

如您所见,查询计划是一个左深的计划:

(Join(Join(Join(nationkey#6 = nationkey#26), custkey), orderkey))

理论上,在这种情况下,也可以执行浓密的计划:

                           Join (over custkey)
                                /  \
 Join(nationkey#6 = nationkey#26)   Join(orderkey#32 = orderkey#14))

这将允许并行执行2个连接。

问题是:(如何)是否可以操纵Catalyst来生成浓密的计划并并行运行连接叶?

我的动机是并行运行独立(小或快)连接,而不是顺序处理多个连接,从而等待扼杀者。

0 个答案:

没有答案