Question

我有一个关于hive mapjoin的问题，我知道当一个小表加入大表时，使用mapjoin会更好，但是当我得到这样的SQL时

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join b
    on (a.id = b.id)
 where b.date = '2018-02-10'
   and b.hour = '10';

提示：
表b是大表，行：10000W +
表a是大表，行：10000W +
带有谓词的表b只返回1000行，我认为这个sql将使用mapjoin，但执行计划是在reduce方面加入......

谁能告诉我为什么？

Answer 1

我不是hive的专家，但有时候，用作SQL客户端的工具（即MySQL Workbench）在设置中隐含了1000个限制。尝试指定一个限制你自己并强制它更高的值1000.

例如，请检查此图片：

这是MySQL Workbench。除非您自己指定限制，否则该限制会自动添加到您的查询中。

Answer 2

尝试将where子句移动到子查询中：

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join (select * from b where b.date = '2018-02-10' and b.hour = '10' )b 
    on a.id = b.id
 ;

此外，中间过滤（临时）表而不是子查询将100％工作，但这不是那么有效。

同时检查这些Hive配置参数：

set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

如果小表未超过hive.mapjoin.smalltable.filesize参数指定的大小，则会将Join转换为map-join。

Hive，一个小的查询块加入大表，为什么不能使用map join？

2 个答案: