Question

我有三张表A，B和C. A有10亿条记录，B有1000万条记录，C有500万条记录。我的查询就像

select * from tableA a left outer join tableB b on a.id=b.id left outer join tableC c on b.id=c.id;

首次加入后，我将拥有超过9.9亿个空的b.id列。现在表C上的第二个连接将需要处理所有9.9亿个NULL行（b.Id），这会导致一个reducer被加载很长时间。有没有办法可以避免使用NULL连接列的行？

Answer 1

将b.id is not null条件添加到ON子句。根据您的Hive版本，这可能有所帮助：

select * 
   from tableA a 
       left outer join tableB b on a.id=b.id 
       left outer join tableC c on b.id=c.id and b.id is not null;

但据我所知，这不是一个问题，因为0.14版本。

此外，您可以划分空行而不是null，并且只能连接非空行。在第一个查询中，只选择了空行。为C表中的列添加NULL作为col。然后使用UNION ALL +选择所有非空行：

with a as(
select a.*, b.* 
   from tableA a 
       left outer join tableB b on a.id=b.id
)

select a.*, null as c_col1 --add all other columns(from c) as null to get same schema
   from a where a.b_id_col is null
UNION ALL
select a.*, c.*
   left outer join tableC c on a.b_id_col=c.id
   from a where a.b_id_col is not null

Answer 2

我们使用rand（）表示NULL;所以我们的加入条件将是

coalesce(b.id, rand()) = c.id

因此null值由它自己分配，但我想知道为什么skewjoin设置没有帮助（我们尝试了coorthce（b.id，＆＃39; SomeString＆＃39;）= c.id with skewjoin enable）< / p>

在hive查询中忽略具有NULL连接列的行

2 个答案: