我有两张桌子。 TABLE1有列:
pers_key
cost
visit
TABLE2有列:
pers_key
months
首先,我创建一个临时表:
CREATE TABLE temp_table as
SELECT pers_key,SUM(cost) AS sum_cost, COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key;
然后,我创建了TABLE3:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
B.sum_cost/A.months AS ind1,
B.visit_count/A.months AS ind2
FROM TABLE2 AS A, temp_table AS B
WHERE A.pers_key = B.pers_key
我想知道是否有更好的方法来实现相同的结果。是否可以在一个查询中完成此操作而不创建temp_table?也许是这样的事情:
CREATE TABLE TABLE3 as
SELECT A.pers_key,
(SUM(B.cost)over (partition by B.pers_key))/A.months AS ind1,
(COUNT(B.visit)over (partition by B.pers_key))/A.months AS ind2
FROM TABLE2 AS A, TABLE1 AS B
WHERE A.pers_key = B.pers_key
或者是实现所需结果集所需的临时表吗?
答案 0 :(得分:2)
如何使用子查询?
SELECT A.pers_key,
B.sum_cost / A.months AS ind1,
B.visit_count / A.months AS ind2
FROM TABLE2 A JOIN
(SELECT pers_key, SUM(cost) AS sum_cost,
COUNT(DISTINCT visit) AS visit_count
FROM TABLE1
GROUP BY pers_key
) B
ON A.pers_key = B.pers_key;
编辑:
你的问题有点复杂。这绝对是一种合理的方法。将子查询放在表中并在表上为连接构建索引可能会更快。但是,红旗是count(distinct)
。根据我对Hive的经验,以下内容比上面的子查询更快:
(SELECT pers_key, SUM(sum_cost) AS sum_cost,
COUNT(visit) AS visit_count
FROM (SELECT pers_key, visit, SUM(cost) as sum_cost
FROM TABLE1
GROUP BY pers_key, visit
) t
GROUP BY pers_key
) B
这个版本更快,对我来说有点反直觉(对我而言)。但是,会发生的是group by
是Hive很容易并行group by
的并行化。另一方面,count(distinct)
被连续处理。这有时会发生在其他数据库中(我在Postgres中看到count(distinct)
的类似行为。另一个警告:我没有设置我发现这个的Hive系统,所以它可能是某种配置问题