Question

我有两个带有ID的表。较小的表具有要从较大的表中排除的ID。我如何在Hive中实现它？

如果我必须使用转换，你可以帮助我使用脚本吗？

Answer 1

假设你有两张如下表：

表a：
ID
1
2
3个

表b：
ID
2
3个

期望的输出表：
ID
1个

以下查询应该有效。

select a.id from a left outer join b on (a.id==b.id) where b.id is null

Answer 2

如果小表足够小以适应内存，请尝试使用Brickhouse中的distributed_map UDF（http://github.com/klout/brickhouse）。将密钥插入本地目录，然后将该本地目录发送到分布式缓存。然后，您可以从Hive查询中访问地图的内容。

insert overwrite local directory 'exclude_ids' 
   select id, value from small_table;

add file exclude_id;

select * from big_table
where map_index( distributed_map( 'exclude_ids' ), id ) = null;

如果小表太大而无法放入内存，并且您可以访问HBase服务器，请尝试使用hbase_cached_get UDF执行类似操作。

实现不等式加入Hive

2 个答案: