Question

我们都知道，通常在SQL中，编写代码时，我们具有定义的词法运算顺序：

SELECT ...
FROM ...
JOIN ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...

在Spark中如何体现？我确实知道这全都与特定对象的属性有关，所以如果我可以用另一种方式提出问题-对于来自SQL的人，在编写Spark应用程序时考虑词法运算顺序的一种有用方法是什么？

为了说明我的困惑。这是我测试中的两段代码，其中将orderBy放在两个完全不同的位置（同样，来自SQL背景），但是代码产生的结果完全相同：

tripDatawithDT \
.filter(tripData["Subscriber Type"] == "Subscriber")\
.orderBy(desc("End Date DT"))\
.groupBy("End Date DT")\
.count()\
.show()


tripDatawithDT \
.filter(tripData["Subscriber Type"] == "Subscriber")\
.groupBy("End Date DT")\
.count()\
.orderBy(desc("End Date DT"))\
.show()

仍然，在其他情况下，由于错误的词法运算顺序，我完全弄乱了我的代码。

Answer 1

TL; DR ，只要您使用不带自定义优化器Rules的标准开源构建，就可以假定每个DSL操作都会引发一个逻辑子查询，并且所有逻辑优化都与SQL：2003标准。换句话说，您的SQL应该在这里适用。

Spark内部表示SQL查询一棵LogicalPlans的树，其中每个运算符对应一个节点，其输入作为子节点。

结果，与DSL表达式相对应的未优化逻辑计划由每个运营商的嵌套节点组成（投影，选择，排序，有或没有分组的聚合）。因此，给出了表格

from pyspark.sql.functions import col, desc

t0 = spark.createDataFrame(
    [], "`End Date DT` timestamp, `Subscriber Type` string"
)
t0.createOrReplaceTempView("t0")

第一个查询

(t0.alias("t0")
    .filter(col("Subscriber Type") == "Subscriber").alias("t1")
    .orderBy(desc("End Date DT")).alias("t2")
    .groupBy("End Date DT")
    .count())

大致相当于*

SELECT `End Date DT`, COUNT(*)  AS count FROM (
    SELECT * FROM (
        SELECT * FROM t0 WHERE `Subscriber Type` = 'Subscriber'
    ) as t1 ORDER BY `End Date DT` DESC
) as t2 GROUP BY `End Date DT`

同时

(t0.alias("t0")
    .filter(col("Subscriber Type") == "Subscriber").alias("t1")
    .groupBy("End Date DT")
    .count().alias("t2")
    .orderBy(desc("End Date DT")))

与

大致等效**

SELECT * FROM (
    SELECT `End Date DT`, COUNT(*) AS count FROM (
        SELECT * FROM t0 WHERE `Subscriber Type` = 'Subscriber'
    ) as t1 GROUP BY `End Date DT`
) as t2 ORDER BY `End Date DT` DESC

很明显，这两个查询都不相同，这反映在它们的优化执行计划中。

在ORDER BY之前的

GROUP BY对应于

== Optimized Logical Plan ==
Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#70L]
+- Sort [End Date DT#38 DESC NULLS LAST], true
   +- Project [End Date DT#38]
      +- Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
         +- LogicalRDD [End Date DT#38, Subscriber Type#39], false

ORDER BY之后的GROUP BY对应

== Optimized Logical Plan ==
Sort [End Date DT#38 DESC NULLS LAST], true
+- Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#84L]
   +- Project [End Date DT#38]
      +- Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
         +- LogicalRDD [End Date DT#38, Subscriber Type#39], false

那么为什么这些可以给出相同的最终结果？这是因为在像这里这样的基本情况下，查询计划者会将前面的ORDER BY视为应用范围分区而不是哈希分区的提示。因此，ORDER BY后跟GROUP BY的物理计划将是

== Physical Plan ==
*(2) HashAggregate(keys=[End Date DT#38], functions=[count(1)])
+- *(2) HashAggregate(keys=[End Date DT#38], functions=[partial_count(1)])
   +- *(2) Sort [End Date DT#38 DESC NULLS LAST], true, 0
      +- Exchange rangepartitioning(End Date DT#38 DESC NULLS LAST, 200)
         +- *(1) Project [End Date DT#38]
            +- *(1) Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
               +- Scan ExistingRDD[End Date DT#38,Subscriber Type#39]

如果没有ORDER BY ***，它将默认为哈希分区

== Physical Plan ==
*(2) HashAggregate(keys=[End Date DT#38], functions=[count(1)])
+- Exchange hashpartitioning(End Date DT#38, 200)
   +- *(1) HashAggregate(keys=[End Date DT#38], functions=[partial_count(1)])
      +- *(1) Project [End Date DT#38]
         +- *(1) Filter (isnotnull(Subscriber Type#39) && (Subscriber Type#39 = Subscriber))
            +- Scan ExistingRDD[End Date DT#38,Subscriber Type#39]

因为这是在计划阶段（这是影响很大的扩展点（尤其是对于数据源提供者））发生的，所以我认为这是实现的详细信息，因此不要依赖于此行为是否正确。

*具有针对DSL变体的已解析逻辑计划

== Parsed Logical Plan ==
'Aggregate ['End Date DT], [unresolvedalias('End Date DT, None), count(1) AS count#45L]
+- SubqueryAlias `t2`
   +- Sort [End Date DT#38 DESC NULLS LAST], true
      +- SubqueryAlias `t1`
         +- Filter (Subscriber Type#39 = Subscriber)
            +- SubqueryAlias `t0`
               +- LogicalRDD [End Date DT#38, Subscriber Type#39], false

和用于SQL变体

== Parsed Logical Plan ==
'Aggregate ['End Date DT], ['End Date DT, 'COUNT(1) AS count#50]
+- 'SubqueryAlias `t2`
   +- 'Sort ['End Date DT DESC NULLS LAST], true
      +- 'Project [*]
         +- 'SubqueryAlias `t1`
            +- 'Project [*]
               +- 'Filter ('Subscriber Type = Subscriber)
                  +- 'UnresolvedRelation `t0`

**具有针对DSL变体的已解析逻辑计划

== Parsed Logical Plan ==
'Sort ['End Date DT DESC NULLS LAST], true
+- SubqueryAlias `t2`
   +- Aggregate [End Date DT#38], [End Date DT#38, count(1) AS count#59L]
      +- SubqueryAlias `t1`
         +- Filter (Subscriber Type#39 = Subscriber)
            +- SubqueryAlias `t0`
               +- LogicalRDD [End Date DT#38, Subscriber Type#39], false

和用于SQL变体

== Parsed Logical Plan ==
'Sort ['End Date DT DESC NULLS LAST], true
+- 'Project [*]
   +- 'SubqueryAlias `t2`
      +- 'Aggregate ['End Date DT], ['End Date DT, 'COUNT(1) AS count#64]
         +- 'SubqueryAlias `t1`
            +- 'Project [*]
               +- 'Filter ('Subscriber Type = Subscriber)
                  +- 'UnresolvedRelation `t0`

***即

(t0.alias("t0") 
    .filter(col("Subscriber Type") == "Subscriber").alias("t1") 
    .groupBy("End Date DT") 
    .count()).explain()

Spark词法运算顺序

1 个答案: