我们都知道,通常在SQL中,编写代码时,我们具有定义的词法运算顺序:
SELECT ...
FROM ...
JOIN ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...
在Spark中如何体现? 我确实知道这全都与特定对象的属性有关,所以如果我可以用另一种方式提出问题-对于来自SQL的人,在编写Spark应用程序时考虑词法运算顺序的一种有用方法是什么?
为了说明我的困惑。这是我测试中的两段代码,其中将orderBy
放在两个完全不同的位置(同样,来自SQL背景),但是代码产生的结果完全相同:
tripDatawithDT \
.filter(tripData["Subscriber Type"] == "Subscriber")\
.orderBy(desc("End Date DT"))\
.groupBy("End Date DT")\
.count()\
.show()
tripDatawithDT \
.filter(tripData["Subscriber Type"] == "Subscriber")\
.groupBy("End Date DT")\
.count()\
.orderBy(desc("End Date DT"))\
.show()
仍然,在其他情况下,由于错误的词法运算顺序,我完全弄乱了我的代码。
答案 0 :(得分:4)
TL; DR ,只要您使用不带自定义优化器Rules
的标准开源构建,就可以假定每个DSL操作都会引发一个逻辑子查询,并且所有逻辑优化都与SQL:2003标准。换句话说,您的SQL应该在这里适用。
Spark内部表示SQL查询一棵LogicalPlans
的树,其中每个运算符对应一个节点,其输入作为子节点。
结果,与DSL表达式相对应的未优化逻辑计划由每个运营商的嵌套节点组成(投影,选择,排序,有或没有分组的聚合)。因此,给出了表格
from pyspark.sql.functions import col, desc
t0 = spark.createDataFrame(
[], "`End Date DT` timestamp, `Subscriber Type` string"
)
t0.createOrReplaceTempView("t0")
第一个查询
(t0.alias("t0")
.filter(col("Subscriber Type") == "Subscriber").alias("t1")
.orderBy(desc("End Date DT")).alias("t2")
.groupBy("End Date DT")
.count())
大致相当于*
SELECT `End Date DT`, COUNT(*) AS count FROM (
SELECT * FROM (
SELECT * FROM t0 WHERE `Subscriber Type` = 'Subscriber'
) as t1 ORDER BY `End Date DT` DESC
) as t2 GROUP BY `End Date DT`
同时
(t0.alias("t0")
.filter(col("Subscriber Type") == "Subscriber").alias("t1")
.groupBy("End Date DT")
.count().alias("t2")
.orderBy(desc("End Date DT")))
与
大致等效**SELECT * FROM (
SELECT `End Date DT`, COUNT(*) AS count FROM (
SELECT * FROM t0 WHERE `Subscriber Type` = 'Subscriber'
) as t1 GROUP BY `End Date DT`
) as t2 ORDER BY `End Date DT` DESC
很明显,这两个查询都不相同,这反映在它们的优化执行计划中。
在ORDER BY
之前的 GROUP BY
对应于
== Optimized Logical Plan ==
Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#70L]
+- Sort [End Date DT#38 DESC NULLS LAST], true
+- Project [End Date DT#38]
+- Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
+- LogicalRDD [End Date DT#38, Subscriber Type#39], false
ORDER BY
之后的GROUP BY
对应
== Optimized Logical Plan ==
Sort [End Date DT#38 DESC NULLS LAST], true
+- Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#84L]
+- Project [End Date DT#38]
+- Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
+- LogicalRDD [End Date DT#38, Subscriber Type#39], false
那么为什么这些可以给出相同的最终结果?这是因为在像这里这样的基本情况下,查询计划者会将前面的ORDER BY
视为应用范围分区而不是哈希分区的提示。因此,ORDER BY
后跟GROUP BY
的物理计划将是
== Physical Plan ==
*(2) HashAggregate(keys=[End Date DT#38], functions=[count(1)])
+- *(2) HashAggregate(keys=[End Date DT#38], functions=[partial_count(1)])
+- *(2) Sort [End Date DT#38 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(End Date DT#38 DESC NULLS LAST, 200)
+- *(1) Project [End Date DT#38]
+- *(1) Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
+- Scan ExistingRDD[End Date DT#38,Subscriber Type#39]
如果没有ORDER BY
***,它将默认为哈希分区
== Physical Plan ==
*(2) HashAggregate(keys=[End Date DT#38], functions=[count(1)])
+- Exchange hashpartitioning(End Date DT#38, 200)
+- *(1) HashAggregate(keys=[End Date DT#38], functions=[partial_count(1)])
+- *(1) Project [End Date DT#38]
+- *(1) Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
+- Scan ExistingRDD[End Date DT#38,Subscriber Type#39]
因为这是在计划阶段(这是影响很大的扩展点(尤其是对于数据源提供者))发生的,所以我认为这是实现的详细信息,因此不要依赖于此行为是否正确。
*具有针对DSL变体的已解析逻辑计划
== Parsed Logical Plan ==
'Aggregate ['End Date DT], [unresolvedalias('End Date DT, None), count(1) AS count#45L]
+- SubqueryAlias `t2`
+- Sort [End Date DT#38 DESC NULLS LAST], true
+- SubqueryAlias `t1`
+- Filter (Subscriber Type#39 = Subscriber)
+- SubqueryAlias `t0`
+- LogicalRDD [End Date DT#38, Subscriber Type#39], false
和用于SQL变体
== Parsed Logical Plan ==
'Aggregate ['End Date DT], ['End Date DT, 'COUNT(1) AS count#50]
+- 'SubqueryAlias `t2`
+- 'Sort ['End Date DT DESC NULLS LAST], true
+- 'Project [*]
+- 'SubqueryAlias `t1`
+- 'Project [*]
+- 'Filter ('Subscriber Type = Subscriber)
+- 'UnresolvedRelation `t0`
**具有针对DSL变体的已解析逻辑计划
== Parsed Logical Plan ==
'Sort ['End Date DT DESC NULLS LAST], true
+- SubqueryAlias `t2`
+- Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#59L]
+- SubqueryAlias `t1`
+- Filter (Subscriber Type#39 = Subscriber)
+- SubqueryAlias `t0`
+- LogicalRDD [End Date DT#38, Subscriber Type#39], false
和用于SQL变体
== Parsed Logical Plan ==
'Sort ['End Date DT DESC NULLS LAST], true
+- 'Project [*]
+- 'SubqueryAlias `t2`
+- 'Aggregate ['End Date DT], ['End Date DT, 'COUNT(1) AS count#64]
+- 'SubqueryAlias `t1`
+- 'Project [*]
+- 'Filter ('Subscriber Type = Subscriber)
+- 'UnresolvedRelation `t0`
***即
(t0.alias("t0")
.filter(col("Subscriber Type") == "Subscriber").alias("t1")
.groupBy("End Date DT")
.count()).explain()