我正在使用postgres 9.2.4。
我们有一个后台工作,可以将用户的电子邮件导入我们的系统,并将它们存储在postgres数据库表中。
下表:
CREATE TABLE emails
(
id serial NOT NULL,
subject text,
body text,
personal boolean,
sent_at timestamp without time zone NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
account_id integer NOT NULL,
sender_user_id integer,
sender_contact_id integer,
html text,
folder text,
draft boolean DEFAULT false,
check_for_response timestamp without time zone,
send_time timestamp without time zone,
CONSTRAINT emails_pkey PRIMARY KEY (id),
CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
REFERENCES accounts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
REFERENCES contacts (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE emails
OWNER TO paulcowan;
-- Index: emails_account_id_index
-- DROP INDEX emails_account_id_index;
CREATE INDEX emails_account_id_index
ON emails
USING btree
(account_id);
-- Index: emails_sender_contact_id_index
-- DROP INDEX emails_sender_contact_id_index;
CREATE INDEX emails_sender_contact_id_index
ON emails
USING btree
(sender_contact_id);
-- Index: emails_sender_user_id_index
-- DROP INDEX emails_sender_user_id_index;
CREATE INDEX emails_sender_user_id_index
ON emails
USING btree
(sender_user_id);
查询更复杂,因为我在这个表上有一个视图,我在其中输入其他数据:
CREATE OR REPLACE VIEW email_graphs AS
SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html,
emails.personal, emails.draft, emails.created_at, emails.updated_at,
emails.sent_at, emails.sender_contact_id, emails.sender_user_id,
emails.addresses, emails.read_by, emails.check_for_response,
emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
FROM emails
LEFT JOIN ( SELECT todos.reference_email_id AS email_id,
array_to_json(array_agg(todos.id)) AS ids
FROM todos
GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
LEFT JOIN ( SELECT calls.reference_email_id AS email_id,
array_to_json(array_agg(calls.id)) AS ids
FROM calls
GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
LEFT JOIN ( SELECT deals.reference_email_id AS email_id,
array_to_json(array_agg(deals.id)) AS ids
FROM deals
GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
LEFT JOIN ( SELECT meetings.reference_email_id AS email_id,
array_to_json(array_agg(meetings.id)) AS ids
FROM meetings
GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
LEFT JOIN ( SELECT comments.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
FROM comments
WHERE comments.email_id IS NOT NULL
GROUP BY comments.email_id) c ON c.email_id = emails.id
LEFT JOIN ( SELECT email_participants.email_id,
array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
FROM email_participants
GROUP BY email_participants.email_id) p ON p.email_id = emails.id
LEFT JOIN ( SELECT attachments.reference_email_id AS email_id,
array_to_json(array_agg(attachments.id)) AS ids
FROM attachments
GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;
ALTER TABLE email_graphs
OWNER TO paulcowan;
然后我们针对此视图运行分页查询,例如
SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0
随着表格的增长,此表格上的查询速度已大幅放缓。
如果我使用EXPLAIN ANALYZE运行分页查询
EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;
我得到了这个结果
-> Seq Scan on deals (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
-> Sort (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
Sort Key: ms.email_id
Sort Method: quicksort Memory: 26kB
-> Subquery Scan on ms (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
-> HashAggregate (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
-> Seq Scan on meetings (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
-> Sort (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
Sort Key: c.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on c (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
-> HashAggregate (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
-> Seq Scan on comments (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (email_id IS NOT NULL)
Rows Removed by Filter: 2
SubPlan 1
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=56) (never executed)
-> Materialize (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
-> GroupAggregate (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
-> Sort (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
Sort Key: public.email_participants.email_id
Sort Method: external sort Disk: 1760kB
-> Seq Scan on email_participants (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
SubPlan 2
-> Values Scan on "*VALUES*" (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
-> Sort (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
Sort Key: atts.email_id
Sort Method: quicksort Memory: 25kB
-> Subquery Scan on atts (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
-> HashAggregate (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
-> Seq Scan on attachments (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
-> Index Only Scan using email_participants_email_id_user_id_index on email_participants (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
Index Cond: (user_id = 75)
Heap Fetches: 43
总运行时间:1642.157 ms (75行)
答案 0 :(得分:1)
我绝对不是在寻找修复程序:)或重构查询。任何类型的高层建议都是最受欢迎的。
根据我的评论,问题的要点在于相互联系的聚合。这可以防止使用索引,并在查询计划中产生一堆合并连接(和实现)......
换句话说,将其视为一个如此复杂的计划,Postgres通过在内存中实现临时表,然后重复排序,直到它们全部合并为适当的方式进行处理。从我站立的地方来看,完整的hogwash似乎等于从所有表中选择所有行,以及它们所有可能的关系。一旦它被弄清楚并且变成了存在,Postgres继续对混乱进行排序以便提取前n行。
无论如何,你想要重写查询,以便它实际上可以使用索引开始。
部分原因很简单。例如,这是一个很大的禁忌:
select …,
ts.ids AS todo_ids, cs.ids AS call_ids,
ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people,
atts.ids AS attachment_ids
在一个查询中获取电子邮件。使用一口大小的email_id in (…)
子句在单独的查询中获取相关对象。仅仅这样做可以加快速度。
对于其余部分,它可能是也可能不简单,或者涉及对架构的一些重新设计。我只扫描了难以理解的怪物及其令人毛骨悚然的查询计划,所以我无法肯定地评论。
答案 1 :(得分:1)
我认为大观点不太可能表现良好,您应该将其分解为更易于管理的组件,但仍然会想到两个具体的建议:
架构更改
将text和html主体移出主表。 虽然大内容会自动存储在TOAST空间中,但邮件部分通常会小于TOAST阈值(~2000字节),特别是对于纯文本,因此不会系统地发生。
如果您认为表的主要用途是包含发件人,收件人,日期,主题等标题字段,则每个非TOASTED内容都会以对I / O性能和缓存有害的方式使表膨胀。 ..
我可以使用mail database中碰巧拥有的内容对此进行测试。在收件箱中的55k邮件样本中:
平均文本/普通大小:1511字节 平均文本/ html大小:11895字节(但42395条消息根本没有html)
没有机身的邮件表的大小:14Mb(无TOAST)
如果将主体添加为另外2个TEXT列,则主存储中为59Mb,TOAST中为61Mb。
尽管有TOAST,主存储似乎要大4倍。因此,在不需要TEXT列的情况下扫描表时,会浪费80%的I / O.未来的行更新可能会因碎片效应而变得更糟。
可以通过pg_statio_all_tables
视图发现块读取方面的效果(在查询之前和之后比较heap_blks_read + heap_blks_hit
)
<强>调整强>
EXPLAIN的这一部分
排序方法:外部排序磁盘:1760kB
表明您work_mem
太小了。你不想为这么小的类别打磁盘。除非你的可用内存不足,否则至少要10Mb。在您使用它时,如果它仍然是默认值,请将shared_buffers
设置为合理的值。有关详情,请参阅http://wiki.postgresql.org/wiki/Performance_Optimization。