组合两个查询,其中一个使用GROUP BY

时间:2015-07-27 17:11:29

标签: sql group-by hive aggregate-functions hiveql

我有两张桌子。 TABLE1有列:

pers_key
cost
visit

TABLE2有列:

pers_key
months

首先,我创建一个临时表:

CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;

然后,我创建了TABLE3:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key

我想知道是否有更好的方法来实现相同的结果。是否可以在一个查询中完成此操作而不创建temp_table?也许是这样的事情:

CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key

或者是实现所需结果集所需的临时表吗?

1 个答案:

答案 0 :(得分:2)

如何使用子查询?

SELECT A.pers_key,
       B.sum_cost / A.months AS ind1,
       B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
     (SELECT pers_key, SUM(cost) AS sum_cost,
             COUNT(DISTINCT visit) AS visit_count
      FROM TABLE1
      GROUP BY pers_key
     ) B
     ON A.pers_key = B.pers_key;

编辑:

你的问题有点复杂。这绝对是一种合理的方法。将子查询放在表中并在表上为连接构建索引可能会更快。但是,红旗是count(distinct)。根据我对Hive的经验,以下内容比上面的子查询更快:

     (SELECT pers_key, SUM(sum_cost) AS sum_cost,
             COUNT(visit) AS visit_count
      FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
            FROM TABLE1
            GROUP BY pers_key, visit
           ) t
      GROUP BY pers_key
     ) B

这个版本更快,对我来说有点反直觉(对我而言)。但是,会发生的是group by是Hive很容易并行group by的并行化。另一方面,count(distinct)被连续处理。这有时会发生在其他数据库中(我在Postgres中看到count(distinct)的类似行为。另一个警告:我没有设置我发现这个的Hive系统,所以它可能是某种配置问题