Spark CBO不显示查询中具有分区列的查询的行数

时间:2018-08-21 06:11:08

标签: apache-spark apache-spark-sql cost-based-optimizer

我正在使用基于成本的优化器(CBO)来处理Spark 2.3.0,以计算针对外部表完成的查询的统计信息。

我在spark中创建了一个外部表:

CREATE EXTERNAL TABLE IF NOT EXISTS test (
eventID string,type string,exchange string,eventTimestamp bigint,sequenceNumber bigint
,optionID string,orderID string,side string,routingFirm string,routedOrderID string
,session string,price decimal(18,8),quantity bigint,timeInForce string,handlingInstructions string
,orderAttributes string,isGloballyUnique boolean,originalOrderID string,initiator string,leavesQty bigint
,symbol string,routedOriginalOrderID string,displayQty bigint,orderType string,coverage string
,result string,resultTimestamp bigint,nbbPrice decimal(18,8),nbbQty bigint,nboPrice decimal(18,8)
,nboQty bigint,reporter string,quoteID string,noteType string,definedNoteData string,undefinedNoteData string
,note string,desiredLeavesQty bigint,displayPrice decimal(18,8),workingPrice decimal(18,8),complexOrderID string
,complexOptionID string,cancelQty bigint,cancelReason string,openCloseIndicator string,exchOriginCode string
,executingFirm string,executingBroker string,cmtaFirm string,mktMkrSubAccount string,originalOrderDate string
,tradeID string,saleCondition string,executionCodes string,buyDetails_side string,buyDetails_leavesQty bigint
,buyDetails_openCloseIndicator string,buyDetails_quoteID string,buyDetails_orderID string,buyDetails_executingFirm string,buyDetails_executingBroker string,buyDetails_cmtaFirm string,buyDetails_mktMkrSubAccount string,buyDetails_exchOriginCode string,buyDetails_liquidityCode string,buyDetails_executionCodes string,sellDetails_side string,sellDetails_leavesQty bigint,sellDetails_openCloseIndicator string,sellDetails_quoteID string,sellDetails_orderID string,sellDetails_executingFirm string,sellDetails_executingBroker string,sellDetails_cmtaFirm string,sellDetails_mktMkrSubAccount string,sellDetails_exchOriginCode string,sellDetails_liquidityCode string,sellDetails_executionCodes string,tradeDate int,reason string,executionTimestamp bigint,capacity string,fillID string,clearingNumber string
,contraClearingNumber string,buyDetails_capacity string,buyDetails_clearingNumber string,sellDetails_capacity string
,sellDetails_clearingNumber string,receivingFirm string,marketMaker string,sentTimestamp bigint,onlyOneQuote boolean
,originalQuoteID string,bidPrice decimal(18,8),bidQty bigint,askPrice decimal(18,8),askQty bigint,declaredTimestamp bigint,revokedTimestamp bigint,awayExchange string,comments string,clearingFirm string )
PARTITIONED BY (date integer ,reporteIDs string ,version integer )
STORED AS PARQUET LOCATION '/home/test/' 

我已经使用以下命令计算了列的统计信息:

val df = spark.read.parquet("/home/test/")
val cols = df.columns.mkString(",")
val analyzeDDL = s"Analyze table events compute statistics for columns $cols"
spark.sql(analyzeDDL)

现在,当我尝试获取查询的统计信息时:

val query = "Select * from test where date > 20180222"

它只给我大小,而不是rowCount:

scala> val exec = spark.sql(query).queryExecution
exec: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('date > 20180222)
   +- 'UnresolvedRelation `test`

== Analyzed Logical Plan ==
eventID: string, type: string, exchange: string, eventTimestamp: bigint, sequenceNumber: bigint, optionID: string, orderID: string, side: string, routingFirm: string, routedOrderID: string, session: string, price: decimal(18,8), quantity: bigint, timeInForce: string, handlingInstructions: string, orderAttributes: string, isGloballyUnique: boolean, originalOrderID: string, initiator: string, leavesQty: bigint, symbol: string, routedOriginalOrderID: string, displayQty: bigint, orderType: string, ... 82 more fields
Project [eventID#797974, type#797975, exchange#797976, eventTimestamp#797977L, sequenceNumber#...
scala>

scala> val stats = exec.optimizedPlan.stats
stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = Statistics(sizeInBytes=1.0 B, hints=none)

我在这里错过任何步骤了吗?如何获取查询的行数。

Spark版本:2.3.0 表格中的文件为实木复合地板格式。

更新 我可以获取csv文件的统计信息。拼花地板文件无法获得相同的效果。

镶木地板和csv的执行计划之间的区别是格式,是在csv中我们得到HiveTableRelation,而镶木地板的Relation

有什么想法吗?

0 个答案:

没有答案