我有以下视图:http://pastebin.com/jgLeM3cd,我的数据库大小约为10 GB。问题是因为DISTINCT
视图执行真的很慢。
SELECT DISTINCT
users.id AS user_id,
contacts.id AS contact_id,
contact_types.name AS relationship,
channels.name AS channel,
feed_items.send_at AS sent_at,
feed_items.body AS message,
feed_items.from_id,
feed_items.feed_id
FROM feed_items
JOIN channels ON feed_items.channel_id = channels.id
JOIN feeds ON feed_items.feed_id = feeds.id
JOIN contacts ON feeds.contact_id = contacts.id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
WHERE contacts.is_fake = false;
例如,以下是使用LIMIT 10
:https://explain.depesz.com/s/K8q2
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=7717200.06..7717200.28 rows=10 width=1113) (actual time=118656.704..118656.726 rows=10 loops=1)
-> Unique (cost=7717200.06..7780174.02 rows=2798843 width=1113) (actual time=118656.702..118656.723 rows=10 loops=1)
-> Sort (cost=7717200.06..7724197.16 rows=2798843 width=1113) (actual time=118656.700..118656.712 rows=10 loops=1)
Sort Key: users.id, contacts.id, contact_types.name, channels.name, feed_items.send_at, feed_items.body, feed_items.from_id, feed_items.feed_id
Sort Method: external merge Disk: 589888kB
-> Hash Join (cost=22677.02..577531.86 rows=2798843 width=1113) (actual time=416.072..12918.259 rows=5301453 loops=1)
Hash Cond: (feed_items.channel_id = channels.id)
-> Hash Join (cost=22675.84..539046.59 rows=2798843 width=601) (actual time=416.052..10703.796 rows=5301636 loops=1)
Hash Cond: (contacts.contact_type_id = contact_types.id)
-> Hash Join (cost=22674.73..500479.61 rows=2820650 width=89) (actual time=416.038..8494.439 rows=5303074 loops=1)
Hash Cond: (feed_items.feed_id = feeds.id)
-> Seq Scan on feed_items (cost=0.00..223787.54 rows=6828254 width=77) (actual time=0.025..2300.762 rows=6820169 loops=1)
-> Hash (cost=18314.88..18314.88 rows=250788 width=16) (actual time=415.830..415.830 rows=268669 loops=1)
Buckets: 4096 Batches: 16 Memory Usage: 806kB
-> Hash Join (cost=1642.22..18314.88 rows=250788 width=16) (actual time=19.562..337.146 rows=268669 loops=1)
Hash Cond: (feeds.contact_id = contacts.id)
-> Seq Scan on feeds (cost=0.00..11888.11 rows=607111 width=8) (actual time=0.013..116.339 rows=607117 loops=1)
-> Hash (cost=1517.99..1517.99 rows=9938 width=12) (actual time=19.537..19.537 rows=9945 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 427kB
-> Hash Join (cost=619.65..1517.99 rows=9938 width=12) (actual time=5.743..16.746 rows=9945 loops=1)
Hash Cond: (contacts.user_id = users.id)
-> Seq Scan on contacts (cost=0.00..699.58 rows=9938 width=12) (actual time=0.005..5.981 rows=9945 loops=1)
Filter: (NOT is_fake)
Rows Removed by Filter: 14120
-> Hash (cost=473.18..473.18 rows=11718 width=4) (actual time=5.728..5.728 rows=11800 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 415kB
-> Seq Scan on users (cost=0.00..473.18 rows=11718 width=4) (actual time=0.004..2.915 rows=11800 loops=1)
-> Hash (cost=1.05..1.05 rows=5 width=520) (actual time=0.004..0.004 rows=5 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on contact_types (cost=0.00..1.05 rows=5 width=520) (actual time=0.002..0.003 rows=5 loops=1)
-> Hash (cost=1.08..1.08 rows=8 width=520) (actual time=0.012..0.012 rows=8 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Seq Scan on channels (cost=0.00..1.08 rows=8 width=520) (actual time=0.006..0.007 rows=8 loops=1)
Total runtime: 118765.513 ms
(34 rows)
我已经在除feed_items.body
之外的几乎所有列上创建了b-tree索引,因为这是text
列。我也增加了work_mem
,但它没有帮助。任何想法如何加快速度?
答案 0 :(得分:0)
正如其他人在评论中所说:
使用 DISTINCT 尽可能少的字段。
增加 work_mem 可能会有所帮助,但它不是一个明确的解决方案(您的查询效率非常低,而且随着数据库的增长,它会再次降级......)
此外:
索引在这样的大型扫描查询中几乎没有帮助:索引可以更快地选择具体结果,但对索引的完全扫描比对表(或连接)的顺序扫描要高得多。
< / LI>唯一的例外是当你只需要选择一张大表的一些记录时。但是计划者很难猜到它所以你需要通过使用子查询或CTE强制它(&#34; WITH&#34;子句)。
在 work_mem 增加的同一行中,9.6版本的PostgreSQL带有并行扫描功能(必须首先手动启用):如果您的服务器是那个版本或者您有机会升级它,它也可以加快响应时间(甚至,无论如何,你的查询似乎需要改进......; - ))。
所以,我的建议是尝试尽可能减少连接中涉及的数据。特别是在第一次加入时。那就是:加入订单很重要。请记住(幸运的是)您没有任何左连接,因此每个连接实际上都是一个潜在的过滤器,因此首先选择较短的表(或者您将选择较少行的表)可以大大减少连接所需的内存。
例如,(根据您的查询,完全没有经过测试,请记住,您的数据分布很重要):
SELECT DISTINCT
users.id AS user_id,
contacts.id AS contact_id,
contact_types.name AS relationship,
channels.name AS channel,
feed_items.send_at AS sent_at,
feed_items.body AS message,
feed_items.from_id,
feed_items.feed_id
-- Base your query in contacts because is the only place where you are making
-- some discardings:
FROM contacts
JOIN feeds ON (
contacts.is_fake = false -- Filter here to reduce join size
and feeds.contact_id = contacts.id -- Actual join condition
)
JOIN feed_items ON feed_items.feed_id = feeds.id
JOIN channels ON channels.id = feed_items.channel_id
JOIN contact_types ON contacts.contact_type_id = contact_types.id
JOIN users ON contacts.user_id = users.id
;
但是,再次:一切都取决于你的实际数据。
尝试一下,解析分析它,识别最昂贵的部分,并考虑改进它的策略。
这只是一些随意的想法,但我希望它可以帮到你一点。
祝你好运!