Question

我正在尝试对大量数据运行配置单元查询。 Geocode查找表有一个ip-from和ip-to范围，我必须比较一个180万行的表。

Hive脚本：

select *
from ip_address a, ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;

在aws EMR上，我运行了一个c3.xlarge集群，在运行期间它停留在67％超过1天但是这里是Stage-1的Hadoop作业信息：

Warning: Shuffle Join JOIN[4][tables = [a, b]] in Stage 'Stage-1:MAPRED' is a cross product
Stage-1: number of mappers: 6; number of reducers: 1

我该怎么做才能提高这个hive脚本的性能？

Answer 1

为了提高性能，请根据加入字段（在您的情况下为IP地址）对较大的表进行分段。有关更多信息，请访问this page

您还可以实现由Facebook实现的smb连接。