想要准备动态查询以在存储单元列表中使用TABLESAMPLE搜索特定存储桶数据

时间:2018-12-24 10:47:32

标签: hive bucket

我创建了一个配置单元存储桶表:

CREATE TABLE IF NOT EXISTS udb.emp_bucket_table (
          emp_id            SMALLINT
         ,emp_city          VARCHAR(10)
         ,emp_salary        BIGINT
         ,emp_joining_date  DATE
)
CLUSTERED BY (emp_id) SORTED BY (emp_id ASC) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS ORC;

然后我从另一个表中加载该存储桶表为:

set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

INSERT INTO TABLE udb.emp_bucket_table 
SELECT 
 emp_id
,emp_city
,emp_salary
,emp_joining_date
FROM udb.emp_table;

emp_bucket_table表完美地加载了10条记录,如下所示,并且每条记录都被加载到一个特定的存储桶中 根据hash_function(bucketing_column)mod num_buckets(在我们的示例中,bucketing_column = emp_id和num_buckets = 5)。

INFO  : Table udb.emp_bucket_table stats: [numFiles=5, numRows=10, totalSize=2553, rawDataSize=1584]

+---------+-----------+-------------+-------------------+--+
| emp_id  | emp_city  | emp_salary  | emp_joining_date  |
+---------+-----------+-------------+-------------------+--+
| 1       | NOIDA     | 10000       | 2018-12-06        |
| 2       | GURGAON   | 50000       | 2018-12-06        |
| 3       | DWARKA    | 92000       | 2018-12-06        |
| 4       | HARYANA   | 55000       | 2017-11-26        |
| 5       | NOIDA     | 5000        | 2017-02-28        |
| 6       | NOIDA     | 80000       | 2016-04-23        |
| 7       | GURGAON   | 8000        | 2018-05-12        |
| 8       | GURGAON   | 80000       | 2018-07-30        |
| 9       | GURGAON   | 80000       | 2018-07-30        |
| 10      | NOIDA     | 70000       | 2016-05-30        |
+---------+-----------+-------------+-------------------+--+

现在,如果我们在查询下面运行,那么我们可以找出哪个记录​​已进入哪个存储桶。

select emp_id,(emp_id % 5)+1 as bucket_key from udb.emp_bucket_table;

+---------+-------------+--+
| emp_id  | bucket_key  |
+---------+-------------+--+
| 5       | 1           |
| 10      | 1           |
| 1       | 2           |
| 6       | 2           |
| 2       | 3           |
| 7       | 3           |
| 3       | 4           |
| 8       | 4           |
| 4       | 5           |
| 9       | 5           |
+---------+-------------+--+

如果我们对存储区'3'进行采样,则可以从配置单元表中获取特定存储区记录,而无需扫描整个表。

select * from udb.emp_bucket_table TABLESAMPLE(BUCKET 3 OUT OF 5);
+---------+-----------+-------------+-------------------+--+
| emp_id  | emp_city  | emp_salary  | emp_joining_date  |
+---------+-----------+-------------+-------------------+--+
| 2       | GURGAON   | 50000       | 2018-12-06        |
| 7       | GURGAON   | 8000        | 2018-05-12        |
+---------+-----------+-------------+-------------------+--+

我正在尝试执行以下查询,并想将我的bucket_key存储在某些配置单元变量中。

set hivevar:bucketkeyvar=(select (emp_id % 5) +1 as bucket_key from udb.emp_table_bucket where emp_id =1); 

我的下面的查询在传递bucketkeyvar时失败

select * from udb.emp_table_bucket TABLESAMPLE(BUCKET '${hivevar:bucketkeyvar}' OUT OF 5);

Error: Error while compiling statement: FAILED: ParseException line 1:35 cannot recognize input near 'TABLESAMPLE' '(' 'BUCKET' in table source (state=42000,code=40000)

实际上,我想通过计算搜索列的哈希列值来准备一个动态搜索查询,以从特定存储段中获取记录。

0 个答案:

没有答案