我去了Databricks网站上发布的blog基于成本的优化器(CBO),它是在Spark 2.2中引入的。
它提到查询计划的成本是根据公式计算的:
cost = weight * cardinality + (1.0 - weight) * size
我的假设是基数是基于联接的,而大小是返回的总行数。
例如,如果在spark中运行以下查询:
val queryStmt = "select * from maha a, maha b where a.county=b.county and a.county='KINGS'"
val exec: QueryExecution = session.sql(queryStmt).queryExecution
val stats: Statistics = exec.optimizedPlan.stats
println(exec.stringWithStats)
输出:
== Optimized Logical Plan ==
Join Inner, (county#7469 = county#7474), Statistics(sizeInBytes=420.2 MB, rowCount=3.50E+6, hints=none)
:- Filter (isnotnull(county#7469) && (county#7469 = KINGS)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
: +- Relation[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)
+- Filter ((county#7474 = KINGS) && isnotnull(county#7474)), Statistics(sizeInBytes=122.4 KB, rowCount=1.87E+3, hints=none)
+- Relation[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] parquet, Statistics(sizeInBytes=15.0 MB, rowCount=2.36E+5, hints=none)
== Physical Plan ==
*(2) BroadcastHashJoin [county#7469], [county#7474], Inner, BuildRight
:- *(2) Project [Year#7467, FirstName#7468, County#7469, Sex#7470, Count#7471]
: +- *(2) Filter (isnotnull(county#7469) && (county#7469 = KINGS))
: +- *(2) FileScan parquet default.maha[Year#7467,FirstName#7468,County#7469,Sex#7470,Count#7471] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [IsNotNull(County), EqualTo(County,KINGS)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, true]))
+- *(1) Project [Year#7472, FirstName#7473, County#7474, Sex#7475, Count#7476]
+- *(1) Filter ((county#7474 = KINGS) && isnotnull(county#7474))
+- *(1) FileScan parquet default.maha[Year#7472,FirstName#7473,County#7474,Sex#7475,Count#7476] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/C:/dev/query-analyzer/spark-warehouse/maha], PartitionFilters: [], PushedFilters: [EqualTo(County,KINGS), IsNotNull(County)], ReadSchema: struct<Year:int,FirstName:string,County:string,Sex:string,Count:int>
我们应该如何计算查询计划的成本?我们应该总结一下中间体的统计数据,还是从最后一步中拿出最终的统计数据来计算它?
Spark-Version:2.3.0