Question

我想加入两个表，这些表有一个共同的列和相同数量的桶，它们具有相同的排序。

除了设置之外，我还需要设置除设置属性以外的任何其他条件吗？

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

Answer 1

如果有两个数据集对于地图侧连接而言太大，那么加入它们的有效方法是将两个数据集分类到存储桶中。

诀窍是通过相同的连接键进行聚类和排序 CREATE TABLE命令（int，price float，quantity int）集合（cid）进入32个桶;

CREATE TABLE customer（id int，first string，last string）聚集（ID）进入32个桶;

这提供了两个主要的优化优势：

Sorting by join key makes joins easy ,all possible matches value resides on the same area on disk 

Hash bucketing a join  key ensures all matching values reside on same node ,equi join can then run with no shuffle .

如何实现Sort Merge Bucketing Map Join？

1 个答案: