使用存储桶优化窗口功能

时间:2018-07-20 14:59:26

标签: hive amazon-emr window-functions bucket apache-tez

我在一个大表上有许多窗口函数,希望我可以通过在PARTITION列中存储数据来进行优化。有太多不同的值无法使用该列作为分区键,因此我要按该列进行存储。

进行此更改时,EXPLAIN计划中没有任何区别(表名称除外)。

create table basic (id bigint, time timestamp) stored as orc;
create table bucketed (id bigint, time timestamp) 
       clustered by (id) into 200 buckets stored as orc;

insert basic select id, time from my_other_table;    # contains 63 files, unbucketed
insert bucketed select id, time from my_other_table; # contains 200 files, bucketed

explain select id, row_number() over (partition by id order by time) from basic;
>   Stage-0
      Fetch Operator
        limit:-1
        Stage-1
          Reducer 2
          File Output Operator [FS_6]
            Select Operator [SEL_4] (rows=468811575 width=48)
              Output:["_col0","_col1"]
              PTF Operator [PTF_3] (rows=468811575 width=48)
                Function definitions:[{},{"name:":"windowingtablefunction",
                    "order by:":"_col1 ASC NULLS FIRST","partition by:":"_col0"}]
                Select Operator [SEL_2] (rows=468811575 width=48)
                  Output:["_col0","_col1"]
                <-Map 1 [SIMPLE_EDGE]
                  SHUFFLE [RS_1]
                    PartitionCols:id
                    TableScan [TS_0] (rows=468811575 width=48)
                      bucket@basic,basic,Tbl:COMPLETE,Col:NONE,Output:["id","time"]

我需要更改任何Hive配置值以利用存储桶吗?还需要设置其他表属性吗?

0 个答案:

没有答案