Question

我想知道查询中特定过滤条件的位置是否会导致明显的性能差异。

我有一个示例表-$vehicle_errors = [ 'total_cost' => true, 're_condition' => true ];：该表始终仅包含1条与当前执行日期有关的记录：

date_dim

现在我有一个查询，例如：

dt | frst_day_mth | last_day_mth
16/05/2019 | 01/05/2019 | 31/05/2019  -- Table always has only 1 row for that day

现在，我必须对诸如select a.id, b.name, c.salary from tableA a inner join tableB b on a.id = b.id inner join tableC c on b.name = c.name之类的日期应用过滤条件。我的问题是-从性能的角度来看，哪个选项（以下）是最好的？最好将其与tableA.eff_dt <= date_dim.last_mth_day一起放在ON clause（选项1）的Join中，以使记录可以尽早减少，或稍后在{{1}中应用}子句（选项2）？表A，B和C各有大约2000万行。我正在使用Spark SQL。

选项1：

subquery

选项2：

where

请让我知道您的评论。

Answer 1

您所拥有的查询就像从表-tableA中过滤表中的行，基于表中的唯一值-date_dim。

因此，我相信，无论您将过滤器放在哪里，火花查询优化器都只会从tableA中读取与过滤条件匹配的行（这是由于下推式过滤机制而发生的）。因此，只有那些行参与联接。

您可以参考此链接以获取更多信息： https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Optimizer-PushDownPredicate.html

过滤条件的性能输出

1 个答案: