我想问一下正确的bucketing和tablesample方法。
我用
创建了一个表格X.CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC
然后我通过
将500M行插入X.set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW
然后我想用条件计算或搜索某些行。粗略地
SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'
但我最好使用tableample作为我的聚类X(action_id,classifier)。 因此,更好的查询将是
SELECT COUNT(*) FROM X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'
上面有什么不对吗? 但是我无法在这两个查询之间找到任何性能提升。
query1和RESULT(没有tablesample。)
SELECT COUNT(*)) from X
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.35 s
--------------------------------------------------------------------------------
It scans full data.
查询2和结果
SELECT COUNT(*)) from X
TABLESAMPLE(BUCKET 1 OUT OF 256 ON action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 256 256 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 15.82 s
--------------------------------------------------------------------------------
It ALSO scans full data.
查询2结果我的预期。
Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 .......... SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 3.xx s
--------------------------------------------------------------------------------
action_id和classifier的值分布均匀且没有偏差数据。
所以我想问你一个正确的查询是什么,只针对1个Bucket和使用1个地图?