以下查询需要花费大量时间才能执行。它与tez执行引擎一起运行。
SELECT STG.EMP_TYPE, DEPT, A.TOTAL_COUNT, COUNT(DISTINCT EMP_ID) AS COUNT_DEPT
FROM
STAGE_SOURCE STG
LEFT OUTER JOIN
( SELECT EMP_TYPE, COUNT(DISTINCT EMP_ID) AS TOTAL_COUNT
FROM STAGE_SOURCE
GROUP BY EMP_TYPE
) A
ON STG.EMP_TYPE = A.EMP_TYPE
GROUP BY STG.EMP_TYPE, DEPT, A.TOTAL_COUNT;
是否有可以提高查询性能的重写选项或优化策略?
答案 0 :(得分:0)
我建议只提取一次表记录。
您的内部聚合计算可以使用window function完成。
我相信这个查询会给你相同的结果,你会摆脱JOIN
。
SELECT
EMP_TYPE,
DEPT,
COUNT( DISTINCT EMP_ID ) OVER ( PARTITION BY EMP_TYPE ) AS TOTAL_COUNT,
COUNT( DISTINCT EMP_ID ) AS COUNT_DEPT
FROM
STAGE_SOURCE
GROUP BY EMP_TYPE, DEPT
请记住,GROUP BY
也可以利用索引。
以下是关于Windowing and Analytics Functions
的Apache Hive手册的链接#Edit发表评论
至少在PostgreSQL
子句DISTINCT
中,在窗口函数聚合计算之后应用,引导我们进行一些可能为您提供所需内容的漏洞。这样我们摆脱了GROUP BY
。了解它在Postgres上的工作原理: SQLFiddle
尝试以下查询:
SELECT
DISTINCT
EMP_TYPE,
DEPT,
COUNT( DISTINCT EMP_ID ) OVER ( PARTITION BY EMP_TYPE ) AS TOTAL_COUNT,
COUNT( DISTINCT EMP_ID ) OVER ( PARTITION BY EMP_TYPE, DEPT ) AS COUNT_DEPT
FROM
STAGE_SOURCE
#Edit 2
SELECT
DISTINCT
EMP_TYPE,
DEPT,
COUNT( DISTINCT EMP_ID ) OVER ( PARTITION BY EMP_TYPE ) AS TOTAL_COUNT,
COUNT( DISTINCT EMP_ID ) OVER ( PARTITION BY EMP_TYPE, DEPT ) AS COUNT_DEPT
FROM (
SELECT DISTINCT EMP_TYPE, DEPT, EMP_ID FROM STAGE_SOURCE
) foo
答案 1 :(得分:0)
通过了解您的查询,我能够理解您需要计算2个值。 首先,在每个EMP_TYPE下计算EMP_ID,以及 其次。 DEPT& S下的EMP_ID计数EMP_TYPE
SELECT
STG.EMP_TYPE,
DEPT,
TOTAL_COUNT,
COUNT(EMP_ID) AS COUNT_DEPT
FROM
STAGE_SOURCE STG
JOIN
( SELECT EMP_TYPE, COUNT(EMP_ID) AS TOTAL_COUNT
FROM STAGE_SOURCE
GROUP BY EMP_TYPE
) A
ON STG.EMP_TYPE = A.EMP_TYPE
GROUP BY STG.EMP_TYPE, DEPT,TOTAL_COUNT;
尽可能使用GROUP BY而不是DISTINCT可以减少运行时间。 如#34;考虑我"如上所述,GROUP BY利用了索引。