为什么等效的,更复杂的查询要快10倍

时间:2016-10-20 08:36:11

标签: postgresql group-by subquery postgresql-9.5

我有一个我要组织的平台。该表基本上代表树结构:

频道 - > (n0)合作伙伴 - > (n1)CampaignGroups - > (n2)广告系列 - > ......(ni)其他级别

CREATE TABLE campaign_tree (
    channel_id int,
    channel_name text,
    partner_name text,
    campaign_group_name text,
    campaign_name text,
    ad_name text
);

为了清理数据,使名称不区分大小写并丢失冗余ID,我首先找到需要更新的数据。所以我有两种解决这个问题的方法:

方法1
首先在上层获取树的结构,然后丢失相同名称的不同ID:

SELECT
    count(1),
    min(campaign_id) AS new_campaign_id,
    campaign_name,
    channel_name,
    partner_name,
    campaign_group_name
FROM
(SELECT DISTINCT
    campaign_id,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name,
    upper(campaign_name) AS campaign_name
FROM
    campaign_tree
) tmp
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING count(1)>1 --only need to get those that we need to sanitize

此查询大约需要350毫秒才能执行。查询计划如下:

HashAggregate  (cost=18008.63..18081.98 rows=5868 width=136) (actual time=391.868..404.130 rows=33 loops=1)
  Output: count(1), min(campaign_tree.campaign_id), (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree_campaign_code.partner_name)), (upper(campaign_tree.campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(1) > 1)
  Rows Removed by Filter: 64855
  ->  Unique  (cost=15324.20..16394.93 rows=58680 width=83) (actual time=282.253..338.041 rows=64998 loops=1)
        Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=282.251..305.340 rows=71382 loops=1)
              Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Key: campaign_tree.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Method: external merge  Disk: 6608kB
              ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.015..146.611 rows=71382 loops=1)
                    Output: campaign_tree.campaign_id, upper(campaign_tree.channel_name), upper(campaign_tree.partner_name), upper(campaign_tree.campaign_group_name), upper(campaign_tree.campaign_name)
Planning time: 0.085 ms
Execution time: 407.383 ms

方法2
直接方法:计算具有相同名称的项目的不同ID。同时确定这些不同ID的最小ID。

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY upper(channel_name), upper(partner_name), upper(campaign_group_name), upper(campaign_name)
HAVING count(distinct campaign_id)>1

结果是相同的,只是顺序不同。每次执行时间约为4秒。查询计划如下:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=3723.908..4004.447 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=3718.016..3934.400 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..150.634 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.066 ms
Execution time: 4006.323 ms

方法3
经过一番讨论,我决定尝试改变第二种方法,并引用表达式而不是在GROUP BY子句中明确地写出它们:

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
   upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY 3, 4, 5, 6
HAVING count(distinct campaign_id)>1

查询计划:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=1148.957..1316.564 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=1148.849..1240.184 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..148.835 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.067 ms
Execution time: 1318.397 ms

不,这个表上没有创建索引。我知道他们会改善一切。这不是这个问题的重点。

问题是:为什么执行时间有这么大的差异?查询计划对我没有任何启发。

4 个答案:

答案 0 :(得分:0)

阅读计划时,如果您通过不同的campaign_id执行唯一与群组的分歧,它们就会出现分歧。

这告诉我,问题是group by count(*) > 1(与您正在做的相同)比group by count(distinct campaign_id)便宜得多

这是有道理的,因为你已经在前者中进行了分组,而在第二种情况下,你需要在第二次分组上进行二次计算。

答案 1 :(得分:0)

只是一个想法,但你可能会尝试:

having max(campaign_id) > min(campaign_id)

执行应该更容易跟踪最小值和最大值的值,而不是跟踪不同数量的ID。

答案 2 :(得分:0)

[详细说明我对预聚合的评论] 恕我直言,成本是在排序中,这是一组函数聚合所需要的,这是不能以任何其他方式完成的(无论如何不在当前版本中)。

真正的解决方案当然是限制四个域(甚至可以枚举它们,和/或将它们压缩到单独的表中)

未经测试,因为我没有表格定义:

SELECT
    channel_name
    , partner_name
    , campaign_group_name
    , campaign_name
    , min(campaign_id) AS new_campaign_id
    , sum(the_count) AS the_count
FROM (SELECT DISTINCT
        upper(channel_name) AS channel_name
        , upper(partner_name) AS partner_name
        , upper(campaign_group_name) AS campaign_group_name
        , upper(campaign_name) AS campaign_name
        , MIN(campaign_id) AS campaign_id
        , sum(the_count) AS the_count
        FROM (SELECT DISTINCT
            channel_name AS channel_name
            , partner_name AS partner_name
            , campaign_group_name AS campaign_group_name
            , campaign_name AS campaign_name
            , MIN(campaign_id) AS campaign_id
            , COUNT(1) AS the_count
            FROM campaign_tree
            GROUP BY 1,2,3,4
            ) pre
    group BY 1,2,3,4 
    ) agg
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING sum(the_count) > 1 --only need to get those that we need to sanitize
        ;

答案 3 :(得分:0)

这不是一个真正的答案,而只是帮助那些想要测试某些模拟数据的人。我希望它有助于理解正在发生的事情。

以下是在数据库中创建名为campaign_tree的表并使用n=71382行模拟数据填充它的Python代码(我从计划中获取了这个数字):

import random

n = 71382
table_name = "campaign_tree"

set = ["You", "may", "say", "I''m", "a", "dreamer", "But", "I''m", "not", "the", "only", "one", "I", "hope", "someday", "you''ll", "join", "us", "And", "the", "world", "will", "be", "as", "one"]
lset = len(set) - 1

transaction = """
BEGIN;
CREATE TABLE """ + table_name + """
(
    campaign_id integer,
    campaign_name text,
    channel_name text,
    partner_name text,
    campaign_group_name text
);

INSERT INTO """ + table_name + """  (campaign_id, campaign_name, channel_name, partner_name, campaign_group_name)
VALUES """

values = []

i = 1
while i <= n:
    values = values + ["(" + \
                            `i` + ", '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "', '" + \
                            set[random.randint(1, lset)] + "')"]
    i = i + 1

transaction = transaction + ",\n".join(values) + "; COMMIT;"

foutput = open("test.sql", "w")
foutput.write(transaction)
foutput.close()

将其另存为test.py,然后执行python test.py。它将生成一个名为test.sql的文件。最后,执行psql -f test.sql,您就完成了。快乐测试:)