我有以下查询:
SELECT DISTINCT
e.id,
folder,
subject,
in_reply_to,
message_id,
"references",
e.updated_at,
(
select count(*)
from emails
where
(
select "references"[1]
from emails
where message_id = e.message_id
) = ANY ("references")
or message_id =
(
select "references"[1]
from emails
where message_id = e.message_id
)
)
FROM "emails" e
INNER JOIN "email_participants"
ON ("email_participants"."email_id" = e."id")
WHERE (("user_id" = 220)
AND ("folder" = 'INBOX'))
ORDER BY e."updated_at" DESC
LIMIT 10 OFFSET 0;
Here是上述查询的explain analyze输出。
查询很好,直到我添加了下面的count子查询:
(
select count(*)
from emails
where
(
select "references"[1]
from emails
where message_id = e.message_id
) = ANY ("references")
or message_id =
(
select "references"[1]
from emails
where message_id = e.message_id
)
)
事实上,我已经尝试过更简单的子查询,而且似乎是聚合函数本身需要花费时间。
那么我可以将count子查询附加到每个结果上吗?我应该在初始查询运行后更新结果吗?
这是一个pastebin,它将创建表并在最后运行性能不佳的查询以显示输出应该是什么。
答案 0 :(得分:3)
根据我对查询语义的理解,您可以简化:
select count(*)
from emails
where
(
select "references"[1]
from emails
where message_id = e.message_id
) = ANY ("references")
or message_id =
(
select "references"[1]
from emails
where message_id = e.message_id
)
为:
select count(*)
from emails
where
e."references"[1] = ANY ("references") OR message_id = e."references"[1]
实际上,message_id不一定是唯一的,但如果对于给定的message_id值,您确实有不同的行,则查询将失败。
然而,这种简化不会显着改变查询的成本。实际上,这里的问题是您需要对表电子邮件进行两次完整扫描才能执行查询(以及对emails_message_id_index进行索引扫描)。您可以使用引用数组上的索引保存一个完整扫描。
您可以使用以下命令创建此类索引:
CREATE INDEX emails_references_index ON emails USING GIN ("references");
单独的索引确实有助于初始查询:只要有最新的统计信息,就像有足够多的行一样,PostgreSQL将执行索引扫描。但是,您应该按如下方式更改子查询,以帮助计划程序对此数组索引执行位图索引扫描:
select count(*)
from emails
where
ARRAY[e."references"[1]] <@ "references"
OR message_id = e."references"[1]
最终查询将为:
SELECT DISTINCT
e.id,
folder,
subject,
in_reply_to,
message_id,
"references",
e.updated_at,
(
select count(*)
from emails
where
ARRAY[e."references"[1]] <@ "references"
OR message_id = e."references"[1]
)
FROM "emails" e
INNER JOIN "email_participants"
ON ("email_participants"."email_id" = e."id")
WHERE (("user_id" = 220)
AND ("folder" = 'INBOX'))
ORDER BY e."updated_at" DESC
LIMIT 10 OFFSET 0;
为了说明预期的收益,一些测试是在虚拟环境中进行的:
答案 1 :(得分:3)
扩展Paul Guyot的答案你可以将子查询移动到派生表中,该表应该执行得更快,因为它在一次扫描中获取消息计数(加上一个连接),而不是每行扫描一次。
SELECT DISTINCT
e.id,
e.folder,
e.subject,
in_reply_to,
e.message_id,
e."references",
e.updated_at,
t1.message_count
FROM "emails" e
INNER JOIN "email_participants"
ON ("email_participants"."email_id" = e."id")
INNER JOIN (
SELECT COUNT(e2.id) message_count, e.message_id
FROM emails e
LEFT JOIN emails e2 ON (ARRAY[e."references"[1]] <@ e2."references"
OR e2.message_id = e."references"[1])
GROUP BY e.message_id
) t1 ON t1.message_id = e.message_id
WHERE (("user_id" = 220)
AND ("folder" = 'INBOX'))
ORDER BY e."updated_at" DESC
LIMIT 10 OFFSET 0;
使用pastebin数据小提琴 - http://www.sqlfiddle.com/#!15/c6298/7
以下是postgres生成的查询计划,用于通过加入派生表来获取相关子查询中的计数与获取计数。我使用了自己的一张桌子,但我认为结果应该是相似的。
相关子查询
"Limit (cost=0.00..1123641.81 rows=1000 width=8) (actual time=11.237..5395.237 rows=1000 loops=1)"
" -> Seq Scan on visit v (cost=0.00..44996236.24 rows=40045 width=8) (actual time=11.236..5395.014 rows=1000 loops=1)"
" SubPlan 1"
" -> Aggregate (cost=1123.61..1123.62 rows=1 width=0) (actual time=5.393..5.393 rows=1 loops=1000)"
" -> Seq Scan on visit v2 (cost=0.00..1073.56 rows=20018 width=0) (actual time=0.002..4.280 rows=21393 loops=1000)"
" Filter: (company_id = v.company_id)"
" Rows Removed by Filter: 18653"
"Total runtime: 5395.369 ms"
加入派生表
"Limit (cost=1173.74..1211.81 rows=1000 width=12) (actual time=21.819..22.629 rows=1000 loops=1)"
" -> Hash Join (cost=1173.74..2697.72 rows=40036 width=12) (actual time=21.817..22.465 rows=1000 loops=1)"
" Hash Cond: (v.company_id = visit.company_id)"
" -> Seq Scan on visit v (cost=0.00..973.45 rows=40045 width=8) (actual time=0.010..0.198 rows=1000 loops=1)"
" -> Hash (cost=1173.71..1173.71 rows=2 width=12) (actual time=21.787..21.787 rows=2 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 1kB"
" -> HashAggregate (cost=1173.67..1173.69 rows=2 width=4) (actual time=21.783..21.784 rows=3 loops=1)"
" -> Seq Scan on visit (cost=0.00..973.45 rows=40045 width=4) (actual time=0.003..6.695 rows=40046 loops=1)"
"Total runtime: 22.806 ms"
答案 2 :(得分:2)
没有测试数据就不容易做到这一点
select
e.id,
folder,
subject,
in_reply_to,
message_id,
"references",
e.updated_at,
sum(the_count) as the_count
from
(
select *, (
"references"[1] = any ("references")
or
message_id = "references"[1]
)::integer as the_count
from emails
) e
inner join
email_participants on email_participants.email_id = e.id
where
user_id = 220
and
folder = 'INBOX'
group by 1, 2, 3, 4, 5, 6, 7
order by e.updated_at desc
limit 10 offset 0;
查询速度慢的原因是您对结果集的每一行执行表或索引搜索。这称为相关子查询。
group by 1, 2,...
只是选择列表中列名的简写。
从布尔到整数的转换产生1或0。
答案 3 :(得分:0)
我在pastebin中使用您的查询作为起点。这与此处发布的不同之处在于它不加入email_participants表。
我相信它可以像这样简单(或者我错过了什么?):
SELECT e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at, COUNT(e1.message_id)
FROM emails e
LEFT OUTER JOIN emails e1
ON e1.message_id = e.message_id
AND (e1.references[1] = ANY (e.references) OR e1.references[1] = e.message_id)
GROUP BY e.id, e.folder, e.subject, e.message_id, e.references, e.updated_at;