使用groupby对不同的多列进行Hive优化

时间:2019-05-31 05:39:29

标签: hadoop optimization hive mapreduce hiveql

我正在MapReduce上进行hive(1.4-cdh)代码优化,在我的项目中,我们使用了很多带有groupby子句的计数独立操作,下面显示了一个示例hql。

DROP TABLE IF EXISTS testdb.NewTable PURGE;
CREATE TABLE testdb.NewTable AS
SELECT a.* FROM (
SELECT col1,
COUNT(DISTINCT col2) AS col2,
COUNT(DISTINCT col3) AS col3,
COUNT(DISTINCT col4) AS col4,
COUNT(DISTINCT col5) AS col5
FROM BaseTable
GROUP BY col1) a
WHERE  a.col3 > 1 OR a.col4 > 1 OR a.col2 > 1 OR a.col5 > 1;

请您帮我一个更好的方法,以减少查询的处理时间。

1 个答案:

答案 0 :(得分:1)

尝试使用collect_set,它将收集不同的值,但不包括空值。

CREATE TABLE testdb.NewTable AS
SELECT a.* FROM (
SELECT col1,
size(collect_set(col2)) AS col2,
size(collect_set(col3)) AS col3,
size(collect_set(col4)) AS col4,
size(collect_set(col5)) AS col5
FROM BaseTable
GROUP BY col1) a
WHERE  a.col3 > 1 OR a.col4 > 1 OR a.col2 > 1 OR a.col5 > 1;