我有一个使用SQL / HQL语言的spark 1.6.2代码。 我真的想知道我的工作是否正在进行分区修剪。 数据按日期分区( cdate field ) 解释计划是:
== Physical Plan ==
Project [coalesce(cdate#74,cdate#38) AS cdate#29,coalesce(account_key#75,account_key#34) AS account_key#30,coalesce(product#76,product#35) AS product#31,(coalesce(amount#77,0.0) + coalesce(amount#36,0.0)) AS amount#32,(coalesce(volume#78L,0) + cast(coalesce(volume#37,0) as bigint)) AS volume#33L]
+- SortMergeOuterJoin [account_key#34,cdate#38,product#35], [account_key#75,cdate#74,product#76], FullOuter, None
:- Sort [account_key#34 ASC,cdate#38 ASC,product#35 ASC], false, 0
: +- TungstenExchange hashpartitioning(account_key#34,cdate#38,product#35,200), None
: +- Project [volume#37,product#35,cdate#38,account_key#34,amount#36]
: +- BroadcastHashJoin [cdate#38], [cdate#24], BuildLeft
: :- Scan ParquetRelation[account_key#34,product#35,amount#36,volume#37,cdate#38] InputPaths: hdfs://hdp1.voicelab.local:8020/apps/hive/warehouse/my.db/daily_profiles
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#24,200), None
: +- TungstenAggregate(key=[cdate#24], functions=[], output=[cdate#24])
: +- Project [cdate#24]
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#24])
: +- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
: +- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[], output=[cdate#20,accountKey#21,product#22])
: +- Project [cdate#20,accountKey#21,product#22]
: +- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
+- Sort [account_key#75 ASC,cdate#74 ASC,product#76 ASC], false, 0
+- TungstenExchange hashpartitioning(account_key#75,cdate#74,product#76,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Final,isDistinct=false),(count(1),mode=Final,isDistinct=false)], output=[cdate#74,account_key#75,product#76,amount#77,volume#78L])
+- TungstenExchange hashpartitioning(cdate#20,accountKey#21,product#22,200), None
+- TungstenAggregate(key=[cdate#20,accountKey#21,product#22], functions=[(sum(amount#23),mode=Partial,isDistinct=false),(count(1),mode=Partial,isDistinct=false)], output=[cdate#20,accountKey#21,product#22,sum#54,count#55L])
+- Scan ExistingRDD[cdate#20,accountKey#21,product#22,amount#23]
如何判断我的工作是否正在使用Metastore进行分区修剪。
您能详细说明Scan ParquetRelation吗?我怎么知道扫描使用分区修剪/发现? #SOME_NUMBER字段的含义是什么,即account_key#34
用例是按日期,帐户,产品汇总数据
答案 0 :(得分:1)
在物理计划中查找PartitionFilters:[...]。如果数组具有非空值,则使用否。我找不到您的计划,除非我错过了或找不到它。