如何在Spark中计算优化查询计划的成本

时间:2018-06-07 16:07:55

标签: apache-spark apache-spark-sql apache-spark-2.0 cost-based-optimizer

我去了Databricks网站上发布的blog基于成本的优化器(CBO),它是在Spark 2.2中引入的。

它提到查询计划的成本是根据公式计算的:

cost = weight * cardinality + (1.0 - weight) * size

我的假设是基数是基于联接的,而大小是返回的总行数。

例如,如果在spark中运行以下查询:

 val queryStmt = "select * from maha a, maha b where a.county=b.county and a.county='KINGS'"
  val exec: QueryExecution = session.sql(queryStmt).queryExecution
  val stats: Statistics = exec.optimizedPlan.stats
  println(exec.stringWithStats)

输出:

== Optimized Logical Plan ==
Join Inner, (county#7469 = county#7474), Statistics(sizeInBytes=420.2 MB, rowCount=3.50E+6, hints=none)
:- Filter (isnotnull(county#7469) && (county#7469 = KINGS)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
:  +- Relation[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)
+- Filter ((county#7474 = KINGS) && isnotnull(county#7474)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
   +- Relation[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)

== Physical Plan ==
*(2) BroadcastHashJoin [county#7469], [county#7474], Inner, BuildRight
:- *(2) Project [Year#7467, FirstName#7468, County#7469, Sex#7470, Count#7471]
:  +- *(2) Filter (isnotnull(county#7469) && (county#7469 = KINGS))
:     +- *(2) FileScan parquet default.maha[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [IsNotNull(County), EqualTo(County,KINGS)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, true]))
   +- *(1) Project [Year#7472, FirstName#7473, County#7474, Sex#7475, Count#7476]
      +- *(1) Filter ((county#7474 = KINGS) && isnotnull(county#7474))
         +- *(1) FileScan parquet default.maha[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [EqualTo(County,KINGS), IsNotNull(County)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>

我们应该如何计算查询计划的成本?我们应该总结一下中间体的统计数据,还是从最后一步中拿出最终的统计数据来计算它?

Spark-Version:2.3.0

0 个答案:

没有答案