Question

我希望在Hive的表中找到重复项，如下所示。

ID      name    phone
1       John    602-230-4040
2       Brian   602-230-3030
3       John    602-230-4040
4       Brian   602-230-3030
5       Jeff    602-230-4040

在关系数据库中使用count by和having子句的count函数的最简单方法。当我使用以下查询时，

select count(name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

抛出异常后

FAILED: UDFArgumentException DISTINCT keyword must be specified

然后我在查询中使用了distinct关键字。

select count(distinct name, phone) cnt, name, phone from mytest group by name, phone having cnt>1;

显然查询没有返回任何行，因为如果我使用distinct关键字，结果中将不会有任何重复记录。

我不确定为什么Hive强制要求在使用group by子句时使用带有count函数的distinct关键字。

有谁能告诉我如何在Hive表中找到重复内容？

Answer 1

如果我正确理解您的用例，您实际上需要COUNT(*)，因为您对纯行计数感兴趣。

SELECT name, phone, COUNT(*) AS cnt FROM mytest GROUP BY name, phone HAVING cnt > 1;

当我将此查询与您的测试数据一起使用时：

hive> SELECT id, name, phone FROM foo;
OK
1   John    602-230-4040
2   Brian   602-230-3030
3   John    602-230-4040
4   Brian   602-230-3030
5   Jeff    602-230-4040
Time taken: 0.32 seconds, Fetched: 5 row(s)
hive> SELECT name, phone, COUNT(*) AS cnt
    > FROM foo GROUP BY name, phone HAVING cnt > 1;
...
... Lots of MapReduce spam
...
Brian       602-230-3030    2
John        602-230-4040    2

与Hive中的group by一起使用的计数功能不同

1 个答案: