我创建了一个AWS EMR集群,通过SSH连接到主节点,启动了Hive,然后根据AWS S3存储桶中的数据创建了一个外部表。但是在某些我希望需要执行某些映射器或化简器作业的查询中,它并没有这样做。 e.x.对于以下查询,我希望执行一些映射器作业,因为我们要过滤到两列:
SELECT item, store FROM tt3 LIMIT 10;
但是没有,并迅速返回结果。
explain
命令确认:
Stage-0 Fetch Operator
limit:10
Limit [LIM_2]
Number of rows:10
Select Operator [SEL_1]
Output:["_col0","_col1"]
TableScan [TS_0]
Output:["item","store"]
它在查询select count(*) from tt3;
上可以正常工作,并首先运行MapReduce作业。
EXPLAIN COUNT(*) FROM tt3;
的输出
Vertex dependency in root stage
Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
Group By Operator [GBY_4] (rows=1 width=8)
Output:["_col0"],aggregations:["count(VALUE._col0)"]
<-Map 1 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_3]
Group By Operator [GBY_2] (rows=1 width=8)
Output:["_col0"],aggregations:["count()"]
Select Operator [SEL_1] (rows=1 width=211312928)
TableScan [TS_0] (rows=1 width=211312928)
default@tt3,tt3,Tbl:COMPLETE,Col:COMPLETE
答案 0 :(得分:0)
这是Hive的预期行为。
在蜂巢中,如果您执行类似select * from table
的简单查询,则将不会运行map reduce作业,因为我们只是从HDFS中转储数据。
Hive# select * from foo;
+---------+-----------+----------+--+
| foo.id | foo.name | foo.age |
+---------+-----------+----------+--+
| 1 | a | 10 |
| 2 | a | 10 |
| 3 | b | 10 |
| 4 | c | 20 |
+---------+-----------+----------+--+
4 rows selected (0.116 seconds)
只要您进行聚合,则reducer
阶段将与map
阶段一起执行。
Hive# select count(*) from table group by name;
INFO : Map 1: 0/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2
INFO : Map 1: 1/1 Reducer 2: 0/1
INFO : Map 1: 1/1 Reducer 2: 0(+1)/1
INFO : Map 1: 1/1 Reducer 2: 1/1
+------+--+
| _c0 |
+------+--+
| 2 |
| 1 |
| 1 |
+------+--+
3 rows selected (13.709 seconds)
我们可以通过向其添加order by子句,在上述查询中添加另一个reducer阶段
Hive# select count(*) cnt from foo group by name order by cnt;
INFO : Map 1: 0/1 Reducer 2: 0/2 Reducer 3: 0/1
INFO : Map 1: 0(+1)/1 Reducer 2: 0/2 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 0/1 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 0(+1)/1 Reducer 3: 0/1
INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 0(+1)/1
INFO : Map 1: 1/1 Reducer 2: 1/1 Reducer 3: 1/1
+------+--+
| cnt |
+------+--+
| 1 |
| 1 |
| 2 |
+------+--+
您可以看到已完成 2个减速器阶段,因为汇总后,我们正在按结果排序
Map1 phase:- Loads the data from HDFS.
Reduer2:- Will does aggregation
Reducer 3:- after aggregation it will order the results to ascending order.
如果您确实对上述查询做了解释
Hive# explain select count(*) cnt from foo group by name order by cnt;
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
请参阅this链接,以使Hive使用Map / Reduce辅助作业时熟悉。