我已经删除了列,即使在设置了所有参数之后,我也没有获得任何性能优势。 下面是我正在使用的查询和我创建的存储桶,我也添加了解释计划结果。
select count(*) from bigtable_main a inner join
big_cnt10000 b where a.srrecordid = b.srrecordid;
---112 seconds....
ALTER TABLE bigtable_main CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
ALTER TABLE big_cnt10000 CLUSTERED BY(srrecordid) SORTED BY(srrecordid) INTO 40 BUCKETS ;
---112 seconds....
---------------------------------------------------
SET hive.enforce.bucketing=true;
SET hive.optimize.bucketmapjoin=true;
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
even the explain plan is same. Any idea?
Vertex dependency in root stage
Map 1 <- Map 3 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_11]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_10]
sort order:
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
value expressions:_col0 (type: bigint)
Group By Operator [GBY_9]
aggregations:["count()"]
outputColumnNames:["_col0"]
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
Select Operator [SEL_8]
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Filter Operator [FIL_16]
predicate:(_col0 = _col11) (type: boolean)
Statistics:Num rows: 31669970 Data size: 3166997036 Basic stats: COMPLETE Column stats: NONE
Map Join Operator [MAPJOIN_19]
| condition map:[{"":"Inner Join 0 to 1"}]
| HybridGraceHashJoin:true
| keys:{"Map 3":"srrecordid (type: string)","Map 1":"srrecordid (type: string)"}
| outputColumnNames:["_col0","_col11"]
| Statistics:Num rows: 63339940 Data size: 6333994073 Basic stats: COMPLETE Column stats: NONE
|<-Map 3 [BROADCAST_EDGE]
| Reduce Output Operator [RS_5]
| key expressions:srrecordid (type: string)
| Map-reduce partition columns:srrecordid (type: string)
| sort order:+
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| Filter Operator [FIL_18]
| predicate:srrecordid is not null (type: boolean)
| Statistics:Num rows: 42529 Data size: 4252905 Basic stats: COMPLETE Column stats: NONE
| TableScan [TS_1]
| alias:b
| Statistics:Num rows: 85058 Data size: 8505810 Basic stats: COMPLETE Column stats: NONE
|<-Filter Operator [FIL_17]
predicate:srrecordid is not null (type: boolean)
Statistics:Num rows: 57581763 Data size: 5758176306 Basic stats: COMPLETE Column stats: NONE
TableScan [TS_0]
alias:a
Statistics:Num rows: 115163525 Data size: 11516352512 Basic stats: COMPLETE Column stats: NONE
答案 0 :(得分:0)
Hive编译器需要元数据,Meta信息决定执行计划。 doc
编译器需要元数据,因此发送getMetaData请求并从MetaStore接收sendMetaData请求。
此元数据用于检查查询树中的表达式以及基于查询谓词修剪分区。编译器生成的计划是阶段的DAG,每个阶段是映射/减少作业,元数据操作或HDFS上的操作。对于map / reduce阶段,计划包含地图运算符树(在映射器上执行的运算符树)和reduce运算符树(用于需要reducer的操作
更改存储语句会更改表的物理存储属性,但不会更改元。
使用适当的铲斗并创建表格。
下面是详细信息的链接。