我有一个我要组织的平台。该表基本上代表树结构:
频道 - > (n0)合作伙伴 - > (n1)CampaignGroups - > (n2)广告系列 - > ......(ni)其他级别
CREATE TABLE campaign_tree (
channel_id int,
channel_name text,
partner_name text,
campaign_group_name text,
campaign_name text,
ad_name text
);
为了清理数据,使名称不区分大小写并丢失冗余ID,我首先找到需要更新的数据。所以我有两种解决这个问题的方法:
方法1
首先在上层获取树的结构,然后丢失相同名称的不同ID:
SELECT
count(1),
min(campaign_id) AS new_campaign_id,
campaign_name,
channel_name,
partner_name,
campaign_group_name
FROM
(SELECT DISTINCT
campaign_id,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name,
upper(campaign_name) AS campaign_name
FROM
campaign_tree
) tmp
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING count(1)>1 --only need to get those that we need to sanitize
此查询大约需要350毫秒才能执行。查询计划如下:
HashAggregate (cost=18008.63..18081.98 rows=5868 width=136) (actual time=391.868..404.130 rows=33 loops=1)
Output: count(1), min(campaign_tree.campaign_id), (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree_campaign_code.partner_name)), (upper(campaign_tree.campaign_group_name))
Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Filter: (count(1) > 1)
Rows Removed by Filter: 64855
-> Unique (cost=15324.20..16394.93 rows=58680 width=83) (actual time=282.253..338.041 rows=64998 loops=1)
Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=282.251..305.340 rows=71382 loops=1)
Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Key: campaign_tree.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Method: external merge Disk: 6608kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.015..146.611 rows=71382 loops=1)
Output: campaign_tree.campaign_id, upper(campaign_tree.channel_name), upper(campaign_tree.partner_name), upper(campaign_tree.campaign_group_name), upper(campaign_tree.campaign_name)
Planning time: 0.085 ms
Execution time: 407.383 ms
方法2
直接方法:计算具有相同名称的项目的不同ID。同时确定这些不同ID的最小ID。
SELECT
count(distinct campaign_id) AS cnt,
min(campaign_id) AS new_campaign_id,
upper(campaign_name) AS campaign_name,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY upper(channel_name), upper(partner_name), upper(campaign_group_name), upper(campaign_name)
HAVING count(distinct campaign_id)>1
结果是相同的,只是顺序不同。每次执行时间约为4秒。查询计划如下:
GroupAggregate (cost=15324.20..17912.57 rows=51588 width=83) (actual time=3723.908..4004.447 rows=33 loops=1)
Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
Rows Removed by Filter: 64855
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=3718.016..3934.400 rows=71382 loops=1)
Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
Sort Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
Sort Method: external merge Disk: 6880kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..150.634 rows=71382 loops=1)
Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.066 ms
Execution time: 4006.323 ms
方法3
经过一番讨论,我决定尝试改变第二种方法,并引用表达式而不是在GROUP BY
子句中明确地写出它们:
SELECT
count(distinct campaign_id) AS cnt,
min(campaign_id) AS new_campaign_id,
upper(campaign_name) AS campaign_name,
upper(channel_name) AS channel_name,
upper(partner_name) AS partner_name,
upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY 3, 4, 5, 6
HAVING count(distinct campaign_id)>1
查询计划:
GroupAggregate (cost=15324.20..17912.57 rows=51588 width=83) (actual time=1148.957..1316.564 rows=33 loops=1)
Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
Group Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
Rows Removed by Filter: 64855
-> Sort (cost=15324.20..15502.65 rows=71382 width=83) (actual time=1148.849..1240.184 rows=71382 loops=1)
Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
Sort Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
Sort Method: external merge Disk: 6880kB
-> Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..148.835 rows=71382 loops=1)
Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.067 ms
Execution time: 1318.397 ms
不,这个表上没有创建索引。我知道他们会改善一切。这不是这个问题的重点。
问题是:为什么执行时间有这么大的差异?查询计划对我没有任何启发。
答案 0 :(得分:0)
阅读计划时,如果您通过不同的campaign_id执行唯一与群组的分歧,它们就会出现分歧。
这告诉我,问题是group by count(*) > 1
(与您正在做的相同)比group by count(distinct campaign_id)
便宜得多
这是有道理的,因为你已经在前者中进行了分组,而在第二种情况下,你需要在第二次分组上进行二次计算。
答案 1 :(得分:0)
只是一个想法,但你可能会尝试:
having max(campaign_id) > min(campaign_id)
执行应该更容易跟踪最小值和最大值的值,而不是跟踪不同数量的ID。
答案 2 :(得分:0)
[详细说明我对预聚合的评论] 恕我直言,成本是在排序中,这是一组函数聚合所需要的,这是不能以任何其他方式完成的(无论如何不在当前版本中)。
真正的解决方案当然是限制四个域(甚至可以枚举它们,和/或将它们压缩到单独的表中)
未经测试,因为我没有表格定义:
SELECT
channel_name
, partner_name
, campaign_group_name
, campaign_name
, min(campaign_id) AS new_campaign_id
, sum(the_count) AS the_count
FROM (SELECT DISTINCT
upper(channel_name) AS channel_name
, upper(partner_name) AS partner_name
, upper(campaign_group_name) AS campaign_group_name
, upper(campaign_name) AS campaign_name
, MIN(campaign_id) AS campaign_id
, sum(the_count) AS the_count
FROM (SELECT DISTINCT
channel_name AS channel_name
, partner_name AS partner_name
, campaign_group_name AS campaign_group_name
, campaign_name AS campaign_name
, MIN(campaign_id) AS campaign_id
, COUNT(1) AS the_count
FROM campaign_tree
GROUP BY 1,2,3,4
) pre
group BY 1,2,3,4
) agg
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING sum(the_count) > 1 --only need to get those that we need to sanitize
;
答案 3 :(得分:0)
这不是一个真正的答案,而只是帮助那些想要测试某些模拟数据的人。我希望它有助于理解正在发生的事情。
以下是在数据库中创建名为campaign_tree
的表并使用n=71382
行模拟数据填充它的Python代码(我从计划中获取了这个数字):
import random
n = 71382
table_name = "campaign_tree"
set = ["You", "may", "say", "I''m", "a", "dreamer", "But", "I''m", "not", "the", "only", "one", "I", "hope", "someday", "you''ll", "join", "us", "And", "the", "world", "will", "be", "as", "one"]
lset = len(set) - 1
transaction = """
BEGIN;
CREATE TABLE """ + table_name + """
(
campaign_id integer,
campaign_name text,
channel_name text,
partner_name text,
campaign_group_name text
);
INSERT INTO """ + table_name + """ (campaign_id, campaign_name, channel_name, partner_name, campaign_group_name)
VALUES """
values = []
i = 1
while i <= n:
values = values + ["(" + \
`i` + ", '" + \
set[random.randint(1, lset)] + "', '" + \
set[random.randint(1, lset)] + "', '" + \
set[random.randint(1, lset)] + "', '" + \
set[random.randint(1, lset)] + "')"]
i = i + 1
transaction = transaction + ",\n".join(values) + "; COMMIT;"
foutput = open("test.sql", "w")
foutput.write(transaction)
foutput.close()
将其另存为test.py
,然后执行python test.py
。它将生成一个名为test.sql
的文件。最后,执行psql -f test.sql
,您就完成了。快乐测试:)