我创建了一个配置单元存储桶表:
CREATE TABLE IF NOT EXISTS udb.emp_bucket_table (
emp_id SMALLINT
,emp_city VARCHAR(10)
,emp_salary BIGINT
,emp_joining_date DATE
)
CLUSTERED BY (emp_id) SORTED BY (emp_id ASC) INTO 5 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS ORC;
然后我从另一个表中加载该存储桶表为:
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
INSERT INTO TABLE udb.emp_bucket_table
SELECT
emp_id
,emp_city
,emp_salary
,emp_joining_date
FROM udb.emp_table;
emp_bucket_table表完美地加载了10条记录,如下所示,并且每条记录都被加载到一个特定的存储桶中 根据hash_function(bucketing_column)mod num_buckets(在我们的示例中,bucketing_column = emp_id和num_buckets = 5)。
INFO : Table udb.emp_bucket_table stats: [numFiles=5, numRows=10, totalSize=2553, rawDataSize=1584]
+---------+-----------+-------------+-------------------+--+
| emp_id | emp_city | emp_salary | emp_joining_date |
+---------+-----------+-------------+-------------------+--+
| 1 | NOIDA | 10000 | 2018-12-06 |
| 2 | GURGAON | 50000 | 2018-12-06 |
| 3 | DWARKA | 92000 | 2018-12-06 |
| 4 | HARYANA | 55000 | 2017-11-26 |
| 5 | NOIDA | 5000 | 2017-02-28 |
| 6 | NOIDA | 80000 | 2016-04-23 |
| 7 | GURGAON | 8000 | 2018-05-12 |
| 8 | GURGAON | 80000 | 2018-07-30 |
| 9 | GURGAON | 80000 | 2018-07-30 |
| 10 | NOIDA | 70000 | 2016-05-30 |
+---------+-----------+-------------+-------------------+--+
现在,如果我们在查询下面运行,那么我们可以找出哪个记录已进入哪个存储桶。
select emp_id,(emp_id % 5)+1 as bucket_key from udb.emp_bucket_table;
+---------+-------------+--+
| emp_id | bucket_key |
+---------+-------------+--+
| 5 | 1 |
| 10 | 1 |
| 1 | 2 |
| 6 | 2 |
| 2 | 3 |
| 7 | 3 |
| 3 | 4 |
| 8 | 4 |
| 4 | 5 |
| 9 | 5 |
+---------+-------------+--+
如果我们对存储区'3'进行采样,则可以从配置单元表中获取特定存储区记录,而无需扫描整个表。
select * from udb.emp_bucket_table TABLESAMPLE(BUCKET 3 OUT OF 5);
+---------+-----------+-------------+-------------------+--+
| emp_id | emp_city | emp_salary | emp_joining_date |
+---------+-----------+-------------+-------------------+--+
| 2 | GURGAON | 50000 | 2018-12-06 |
| 7 | GURGAON | 8000 | 2018-05-12 |
+---------+-----------+-------------+-------------------+--+
我正在尝试执行以下查询,并想将我的bucket_key存储在某些配置单元变量中。
set hivevar:bucketkeyvar=(select (emp_id % 5) +1 as bucket_key from udb.emp_table_bucket where emp_id =1);
我的下面的查询在传递bucketkeyvar时失败
select * from udb.emp_table_bucket TABLESAMPLE(BUCKET '${hivevar:bucketkeyvar}' OUT OF 5);
Error: Error while compiling statement: FAILED: ParseException line 1:35 cannot recognize input near 'TABLESAMPLE' '(' 'BUCKET' in table source (state=42000,code=40000)
实际上,我想通过计算搜索列的哈希列值来准备一个动态搜索查询,以从特定存储段中获取记录。