Question

这是查询：

(SELECT * 
FROM url 
WHERE domain = 'youtube.com' 
  AND timestamp > NOW() - INTERVAL '24 hours' 
ORDER BY likes DESC LIMIT 10) 
UNION 
(SELECT * 
FROM url 
WHERE domain = 'twitter.com' 
  AND timestamp > NOW() - INTERVAL '24 hours' 
ORDER BY likes DESC LIMIT 10) 
UNION 
(SELECT * 
FROM url 
WHERE domain = 'reddit.com' 
  AND timestamp > NOW() - INTERVAL '24 hours' 
ORDER BY likes DESC LIMIT 10) 
ORDER BY timestamp DESC

这是EXPLAIN ANALYZE。

Sort  (cost=20460.17..20460.25 rows=30 width=497) (actual time=5161.013..5161.015 rows=30 loops=1)
  Sort Key: url."timestamp" DESC
  Sort Method: quicksort  Memory: 53kB
  ->  HashAggregate  (cost=20459.14..20459.44 rows=30 width=497) (actual time=5160.709..5160.738 rows=30 loops=1)
        Group Key: url.url, url.domain, url.title, url.views, url.likes, url.dislikes, url.comments, url.shares, url.links_to_url, url."user", url.thumbnail_url, url.is_collection, url.image_url, url.video_url, url.audio_url, url.width, url.height, url.body, url.source, url."timestamp", url.created_at, url.updated_at, url.duration_seconds, url.tags, url.channel
        ->  Append  (cost=0.43..20457.26 rows=30 width=497) (actual time=0.514..5160.073 rows=30 loops=1)
              ->  Limit  (cost=0.43..18150.71 rows=10 width=1177) (actual time=0.513..28.599 rows=10 loops=1)
                    ->  Index Scan Backward using "url-likes-index" on url  (cost=0.43..816763.00 rows=450 width=1177) (actual time=0.511..28.594 rows=10 loops=1)
                          Filter: (((domain)::text = 'youtube.com'::text) AND ("timestamp" > (now() - '24:00:00'::interval)))
                          Rows Removed by Filter: 11106
              ->  Limit  (cost=0.43..859.82 rows=10 width=1177) (actual time=2330.390..5033.214 rows=10 loops=1)
                    ->  Index Scan Backward using "url-likes-index" on url url_1  (cost=0.43..816763.00 rows=9504 width=1177) (actual time=2330.388..5033.200 rows=10 loops=1)
                          Filter: (((domain)::text = 'twitter.com'::text) AND ("timestamp" > (now() - '24:00:00'::interval)))
                          Rows Removed by Filter: 1667422
              ->  Limit  (cost=0.43..1446.28 rows=10 width=1177) (actual time=64.748..98.228 rows=10 loops=1)
                    ->  Index Scan Backward using "url-likes-index" on url url_2  (cost=0.43..816763.00 rows=5649 width=1177) (actual time=64.745..98.220 rows=10 loops=1)
                          Filter: (((domain)::text = 'reddit.com'::text) AND ("timestamp" > (now() - '24:00:00'::interval)))
                          Rows Removed by Filter: 26739
Planning Time: 3.006 ms
Execution Time: 5162.201 ms

如果您有兴趣自己运行它，请go to this link。

我看到有100万个Twitter行被过滤，但是我不确定如何避免。我有一个timestamp索引，我希望可以用它代替likes进行排序并扫描整个内容。这是否意味着我需要综合索引？有没有办法使计划者使用两个索引而不是另一个？

p.s。我想我对主键是url感到厌烦。它使索引不必要地变大。

Answer 1

PostgreSQL尝试使用likes上的索引来避免获得前10个结果的排序，但是它必须丢弃许多行才能到达那里。

也许执行计划是最好的计划，也许不是。

请按照以下步骤操作：

在您的表上运行ANALYZE，看是否能解决问题。
如果不是，请在(domain, timestamp)上创建索引（按此顺序！），看看是否可以改善问题。
如果还不够，那么
- 将索引放在likes上（如果可以）
或
- 将ORDER BY likes更改为ORDER BY likes + 0。

如果所有这些都不能改善它，那么您原来的查询计划是最好的，您所能做的就是使用更多的RAM，以希望更多的数据在缓存中。

Answer 2

我建议这样编写查询：

SELECT ufiltered.*
FROM (SELECT url.*,
            ROW_NUMBER() OVER (PARTITION BY domain ORDER BY likes DESC) AS seqnum
      FROM url 
      WHERE domain IN ('youtube.com', 'twitter.com', 'reddit.com') AND
            timestamp > NOW() - INTERVAL '24 hours'
    ) AS ufiltered
WHERE seqnum <= 10
ORDER BY timestamp DESC

为此，我建议在url(timestamp, domain, likes)上建立索引。

为什么这个pg查询这么慢？我怎样才能使其更快？

2 个答案: