我在一个大表上有许多窗口函数,希望我可以通过在PARTITION
列中存储数据来进行优化。有太多不同的值无法使用该列作为分区键,因此我要按该列进行存储。
进行此更改时,EXPLAIN
计划中没有任何区别(表名称除外)。
create table basic (id bigint, time timestamp) stored as orc;
create table bucketed (id bigint, time timestamp)
clustered by (id) into 200 buckets stored as orc;
insert basic select id, time from my_other_table; # contains 63 files, unbucketed
insert bucketed select id, time from my_other_table; # contains 200 files, bucketed
explain select id, row_number() over (partition by id order by time) from basic;
> Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
Select Operator [SEL_4] (rows=468811575 width=48)
Output:["_col0","_col1"]
PTF Operator [PTF_3] (rows=468811575 width=48)
Function definitions:[{},{"name:":"windowingtablefunction",
"order by:":"_col1 ASC NULLS FIRST","partition by:":"_col0"}]
Select Operator [SEL_2] (rows=468811575 width=48)
Output:["_col0","_col1"]
<-Map 1 [SIMPLE_EDGE]
SHUFFLE [RS_1]
PartitionCols:id
TableScan [TS_0] (rows=468811575 width=48)
bucket@basic,basic,Tbl:COMPLETE,Col:NONE,Output:["id","time"]
我需要更改任何Hive配置值以利用存储桶吗?还需要设置其他表属性吗?