我想在Spark SQL中执行以下sql查询:
sqlContext.sql("SELECT c.name, c.nationkey, n.name, l.orderkey, o.orderdate
FROM customers c, nations n, orders o, lineitems l
WHERE n.nationkey=20 AND c.nationkey=n.nationkey AND c.custkey=o.custkey AND o.orderkey=l.orderkey");
因此,要执行3个连接。
Catalyst,Spark SQL中的查询分析器和优化器,返回以下优化的逻辑和物理计划:
== Optimized Logical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- Join Inner, Some((orderkey#32 = orderkey#14))
:- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
: +- Join Inner, Some((custkey#3 = custkey#30))
: :- Project [name#25,custkey#3,nationkey#6,name#5]
: : +- Join Inner, Some((nationkey#6 = nationkey#26))
: : :- Project [custkey#3,nationkey#6,name#5]
: : : +- LogicalRDD [acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7], MapPartitionsRDD[3] at createDataFrame at Query.java:66
: : +- Project [nationkey#26,name#25]
: : +- Filter (nationkey#26 = 20)
: : +- LogicalRDD [comment#24,name#25,nationkey#26,regionkey#27], MapPartitionsRDD[11] at createDataFrame at Query.java:76
: +- Project [orderkey#32,orderdate#31,custkey#30]
: +- LogicalRDD [clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36], MapPartitionsRDD[15] at createDataFrame at Query.java:81
+- Project [orderkey#14]
+- LogicalRDD [comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23], MapPartitionsRDD[7] at createDataFrame at Query.java:71
== Physical Plan ==
Project [name#5,nationkey#6,name#25,orderkey#14,orderdate#31]
+- SortMergeJoin [orderkey#32], [orderkey#14]
:- Sort [orderkey#32 ASC], false, 0
: +- TungstenExchange hashpartitioning(orderkey#32,200), None
: +- Project [orderdate#31,nationkey#6,name#5,name#25,orderkey#32]
: +- SortMergeJoin [custkey#3], [custkey#30]
: :- Sort [custkey#3 ASC], false, 0
: : +- TungstenExchange hashpartitioning(custkey#3,200), None
: : +- Project [name#25,custkey#3,nationkey#6,name#5]
: : +- SortMergeJoin [nationkey#6], [nationkey#26]
: : :- Sort [nationkey#6 ASC], false, 0
: : : +- TungstenExchange hashpartitioning(nationkey#6,200), None
: : : +- Project [custkey#3,nationkey#6,name#5]
: : : +- Scan ExistingRDD[acctbal#0,address#1,comment#2,custkey#3,mktsegment#4,name#5,nationkey#6,phone#7]
: : +- Sort [nationkey#26 ASC], false, 0
: : +- TungstenExchange hashpartitioning(nationkey#26,200), None
: : +- Project [nationkey#26,name#25]
: : +- Filter (nationkey#26 = 20)
: : +- Scan ExistingRDD[comment#24,name#25,nationkey#26,regionkey#27]
: +- Sort [custkey#30 ASC], false, 0
: +- TungstenExchange hashpartitioning(custkey#30,200), None
: +- Project [orderkey#32,orderdate#31,custkey#30]
: +- Scan ExistingRDD[clerk#28,comment#29,custkey#30,orderdate#31,orderkey#32,orderpriority#33,orderstatus#34,shippriority#35,totalprice#36]
+- Sort [orderkey#14 ASC], false, 0
+- TungstenExchange hashpartitioning(orderkey#14,200), None
+- Project [orderkey#14]
+- Scan ExistingRDD[comment#8,commitdate#9,discount#10,extendedprice#11,linenumber#12,linestatus#13,orderkey#14,partkey#15,quantity#16,receiptdate#17,returnflag#18,shipdate#19,shipinstruct#20,shipmode#21,suppkey#22,tax#23]
如您所见,查询计划是一个左深的计划:
(Join(Join(Join(nationkey#6 = nationkey#26), custkey), orderkey))
理论上,在这种情况下,也可以执行浓密的计划:
Join (over custkey)
/ \
Join(nationkey#6 = nationkey#26) Join(orderkey#32 = orderkey#14))
这将允许并行执行2个连接。
问题是:(如何)是否可以操纵Catalyst来生成浓密的计划并并行运行连接叶?
我的动机是并行运行独立(小或快)连接,而不是顺序处理多个连接,从而等待扼杀者。