Question

我使用和不使用SMB加入运行查询并得到不同的结果。请帮忙解释一下。

SET hive.enforce.bucketing=true;

create table dbaproceduresbuckets (
owner           string ,
object_name     string ,
procedure_name  string ,
object_id       double ,
subprogram_id   double ,
overload        string ,
object_type     string ,
aggregate       string ,
pipelined       string ,
impltypeowner   string ,
impltypename    string ,
parallel        string ,
interface       string ,
deterministic   string ,
authid          string )
CLUSTERED BY (object_id) SORTED BY (OBJECT_ID ASC) INTO 32 BUCKETS;

CREATE TABLE dbaobjectsbuckets1(
owner            string,
object_name      string,
subobject_name   string,
object_id        double,
data_object_id   double,
object_type      string,
created          string,
last_ddl_time    string,
timestamp        string,
status           string,
temporary        string,
generated        string,
secondary        string,
namespace        double,
edition_name     string) CLUSTERED BY (object_id) SORTED BY (OBJECT_ID ASC) INTO 32 BUCKETS;

**** load the table;

0：jdbc：hive2：// xxxxxx：10000＆gt;从dbaobjectsbuckets1中选择count（*），dbaproceduresbuckets b 0：jdbc：hive2：// xxxxxxxx：10000＆gt;其中a.object_id = b.object_id; INFO：Stage-2的Hadoop作业信息：映射器数量：3;减速机数量：1 INFO：2016-06-13 15：56：00,381 Stage-2 map = 0％，reduce = 0％ INFO：2016-06-13 15：56：55,818 Stage-2 map = 1％，reduce = 0％，累计CPU 122.6秒 INFO：2016-06-13 15：57：47,124 Stage-2 map = 7％，reduce = 0％，累计CPU 326.86秒 .......... INFO：2016-06-13 16：05：01,246 Stage-2 map = 100％，reduce = 100％，累计CPU 867.1秒信息：MapReduce累计CPU总时间：14分27秒100毫秒信息：结束工作= job_1464280256859_0146 + -------- + - + | _c0 | + -------- + - + | 54876 | + -------- + - +

**** 
set hive.auto.convert.sortmerge.join=true; 
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;   

0: jdbc:hive2://xxxxxxx:10000> select count(*) from   dbaobjectsbuckets1 a, dbaproceduresbuckets b

0：jdbc：hive2：// xxxxxxxx：10000＆gt;其中a.object_id = b.object_id;

in the execution plan, I am seeing

 **** but the result is showing
 INFO  : Hadoop job information for Stage-1: number of mappers: 32; number of reducers: 1
  ......
 INFO  : MapReduce Total cumulative CPU time: 4 minutes 8 seconds 490 msec

信息：结束工作= job_1464280256859_0150 + ------ + - + | _c0 | + ------ + - + | 2 | + ------ + - +

?????我的问题是为什么当我使用SMB加入时它只有2？??????它应该是54876。

谢谢！

Answer 1

在将数据插入到排序表中时使用sort by子句

或

set hive.enforce.sorting=true

在将数据插入到排序表之前

Hive Sort Merge Bucket Map（SMB Map）加入

1 个答案: