群集表上的Hive TABLESAMPLE

时间:2016-05-13 13:44:54

标签: hadoop hive hiveql

我想问一下正确的bucketing和tablesample方法。

我用

创建了一个表格X.
CREATE TABLE `X`(`action_id` string,`classifier` string)
CLUSTERED BY (action_id,classifier) INTO 256 BUCKETS
STORED AS ORC

然后我通过

将500M行插入X.
set hive.enforce.bucketing=true;
INSERT OVERWRITE INTO X SELECT * FROM X_RAW

然后我想用条件计算或搜索某些行。粗略地

SELECT COUNT(*) FROM X WHERE action_id='aaa' AND classifier='bbb'

但我最好使用tableample作为我的聚类X(action_id,classifier)。 因此,更好的查询将是

SELECT COUNT(*) FROM X 
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' AND classifier='bbb'

上面有什么不对吗? 但是我无法在这两个查询之间找到任何性能提升。

query1和RESULT(没有tablesample。)

SELECT COUNT(*)) from X 
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.35 s    
--------------------------------------------------------------------------------
It scans full data.

查询2和结果

SELECT COUNT(*)) from X 
TABLESAMPLE(BUCKET 1 OUT OF 256 ON  action_id, classifier)
WHERE action_id='aaa' and classifier='bbb'

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED    256        256        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 15.82     s    
--------------------------------------------------------------------------------
It ALSO scans full data.

查询2结果我的预期。

Result what I expected is something like...
(use 1 map and relatively faster than without tabmesample)
--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 3.xx     s    
--------------------------------------------------------------------------------

action_id和classifier的值分布均匀且没有偏差数据。

所以我想问你一个正确的查询是什么,只针对1个Bucket和使用1个地图?

0 个答案:

没有答案