从大表中获取每个父母的最新子项 - 查询太慢

时间:2011-11-10 23:27:52

标签: sql django performance postgresql aggregate-functions

我有一个由Django的ORM生成的查询,这需要花费数小时才能运行。

report_rank表(5000万行)与report_profile(100k行)的一对多关系。我正在尝试为每个report_rank检索最新的report_profile

我在一台额外的大型Amazon EC2服务器上运行Postgres 9.1,该服务器有足够的可用内存(使用2GB / 15GB)。磁盘IO当然非常糟糕。

我在report_rank.created以及所有外键字段都有索引。

如何加快查询速度?我很乐意尝试使用不同的查询方法,如果它将具有高性能,或者调整所需的任何数据库配置参数。

EXPLAIN 
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created"
     , MAX(T7."created") AS "max" 
FROM "report_rank" 
LEFT OUTER JOIN "report_site" 
  ON ("report_rank"."site_id" = "report_site"."id") 
INNER JOIN "report_profile" 
  ON ("report_site"."id" = "report_profile"."site_id") 
INNER JOIN "crm_client" 
  ON ("report_profile"."client_id" = "crm_client"."id") 
INNER JOIN "auth_user" 
  ON ("crm_client"."user_id" = "auth_user"."id") 
LEFT OUTER JOIN "report_rank" T7 
  ON ("report_site"."id" = T7."site_id") 
WHERE ("auth_user"."is_active" = True  AND "crm_client"."is_deleted" = False ) 
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created" 
HAVING MAX(T7."created") =  "report_rank"."created";

EXPLAIN的输出:

GroupAggregate  (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
  Filter: (max(t7.created) = report_rank.created)
  ->  Sort  (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
        Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
        ->  Hash Join  (cost=1323766.36..6107863.59 rows=4658113881 width=72)
              Hash Cond: (report_rank.site_id = report_site.id)
              ->  Seq Scan on report_rank  (cost=0.00..1076119.27 rows=48133327 width=64)
              ->  Hash  (cost=1312601.51..1312601.51 rows=893188 width=16)
                    ->  Hash Right Join  (cost=47050.38..1312601.51 rows=893188 width=16)
                          Hash Cond: (t7.site_id = report_site.id)
                          ->  Seq Scan on report_rank t7  (cost=0.00..1076119.27 rows=48133327 width=12)
                          ->  Hash  (cost=46692.28..46692.28 rows=28648 width=8)
                                ->  Nested Loop  (cost=2201.98..46692.28 rows=28648 width=8)
                                      ->  Hash Join  (cost=2201.98..5733.23 rows=28648 width=4)
                                            Hash Cond: (crm_client.user_id = auth_user.id)
                                            ->  Hash Join  (cost=2040.73..5006.71 rows=44606 width=8)
                                                  Hash Cond: (report_profile.client_id = crm_client.id)
                                                  ->  Seq Scan on report_profile  (cost=0.00..1706.09 rows=93009 width=8)
                                                  ->  Hash  (cost=1761.98..1761.98 rows=22300 width=8)
                                                        ->  Seq Scan on crm_client  (cost=0.00..1761.98 rows=22300 width=8)
                                                              Filter: (NOT is_deleted)
                                            ->  Hash  (cost=126.85..126.85 rows=2752 width=4)
                                                  ->  Seq Scan on auth_user  (cost=0.00..126.85 rows=2752 width=4)
                                                        Filter: is_active
                                      ->  Index Scan using report_site_pkey on report_site  (cost=0.00..1.42 rows=1 width=4)
                                            Index Cond: (id = report_profile.site_id)

3 个答案:

答案 0 :(得分:7)

最重要的一点是,您JOINGROUP只能得到max(created)。单独获取此值。

您提到了此处所需的所有索引:在report_rank.created和外键上。你在那里做得很好。 (如果你对“好”感兴趣,继续阅读!)

LEFT JOIN report_site条款将强制JOINWHERE。我用一个简单的JOIN代替了。我也简化了你的语法。

2015年7月更新,提供更简单,更快速的查询和更智能的功能。

多行解决方案

report_rank.created 并非唯一,您想要所有最新的行。
在子查询中使用窗口函数rank()

SELECT r.id, r.keyword_id, r.site_id
     , r.rank, r.url, r.competition
     , r.source, r.country, r.created  -- same as "max"
FROM  (
   SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
   FROM   report_rank r
   WHERE  EXISTS (
      SELECT *
      FROM   report_site    s
      JOIN   report_profile p ON p.site_id = s.id
      JOIN   crm_client     c ON c.id      = p.client_id
      JOIN   auth_user      u ON u.id      = c.user_id
      WHERE  s.id = r.site_id
      AND    u.is_active
      AND    c.is_deleted = FALSE
      )
   ) sub
WHERE  rnk = 1;

为什么DESC NULLS LAST

一行解决方案

如果report_rank.created 唯一,或者您对任意一行 max(created)感到满意:

SELECT id, keyword_id, site_id
     , rank, url, competition
     , source, country, created  -- same as "max"
FROM   report_rank r
WHERE  EXISTS (
    SELECT 1
    FROM   report_site    s
    JOIN   report_profile p ON p.site_id = s.id
    JOIN   crm_client     c ON c.id      = p.client_id
    JOIN   auth_user      u ON u.id      = c.user_id
    WHERE  s.id = r.site_id
    AND    u.is_active
    AND    c.is_deleted = FALSE
   )
-- AND  r.created > f_report_rank_cap()
ORDER  BY r.created DESC NULLS LAST
LIMIT  1;

应该更快,仍然。更多选择:

具有动态调整的部分索引的终极速度

您可能已经注意到上一个查询中的注释部分:

AND  r.created > f_report_rank_cap()

你提到50 mio。行,这很多。这是一种加快速度的方法:

  • 创建一个简单的IMMUTABLE函数,返回一个时间戳,该时间戳保证比感兴趣的行更早,同时尽可能年轻。
  • 仅根据此功能在较年轻的行上创建partial index
  • 在与索引条件匹配的查询中使用WHERE条件。
  • 使用动态DDL创建另一个将这些对象更新到最新行的函数。 (如果最新行被删除/停用,请减去安全边际 - 如果可能发生这种情况)
  • 在关闭时调用此辅助功能,每个cronjob或按需最少并发活动。你可以随心所欲,不会伤害它,只需要在桌子上进行短暂的独占锁定。

这是完整的工作演示 @erikcw,您必须按照以下说明激活注释部分。

CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());

-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
  RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$;  -- or as high as you can safely bet.

-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE  created > f_report_rank_cap();

-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
  RETURNS void AS
$func$
DECLARE
   _secure_margin CONSTANT interval := interval '1 day';  -- adjust to your case
   _cap timestamp;  -- exclude older rows than this from partial index
BEGIN
   SELECT max(created) - _secure_margin
   FROM   report_rank
   WHERE  created > f_report_rank_cap() + _secure_margin
   /*  not needed for the demo; @erikcw needs to activate this
   AND    EXISTS (
     SELECT *
     FROM   report_site    s
     JOIN   report_profile p ON p.site_id = s.id
     JOIN   crm_client     c ON c.id      = p.client_id
     JOIN   auth_user      u ON u.id      = c.user_id
     WHERE  s.id = r.site_id
     AND    u.is_active
     AND    c.is_deleted = FALSE)
   */
   INTO   _cap;

   IF FOUND THEN
     -- recreate function
     EXECUTE format('
     CREATE OR REPLACE FUNCTION f_report_rank_cap()
       RETURNS timestamp LANGUAGE sql IMMUTABLE AS
     $y$SELECT %L::timestamp$y$', _cap);

     -- reindex
     REINDEX INDEX report_rank_recent_idx;
   END IF;
END
$func$  LANGUAGE plpgsql;

COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
    and reindex partial index on report_rank.';

呼叫:

SELECT f_report_rank_set_cap();

请参阅:

SELECT f_report_rank_cap();

在上面的查询中取消注释AND r.created > f_report_rank_cap()条款并观察其差异。验证索引是否与EXPLAIN ANALYZE一起使用。

The manual on concurrency and REINDEX

  

要在不干扰生产的情况下构建索引,您应该删除索引并重新发出CREATE INDEX CONCURRENTLY命令。

答案 1 :(得分:1)

-- modelled after Erwin's version
-- does the x query really return only one row?

SELECT r.id, r.keyword_id, r.site_id
    , r.rank, r.url, r.competition, r.source
    , r.country, r.created, x.max_created
-- UPDATE3: I forgot one, too
FROM report_rank r
LEFT   JOIN report_site s  ON (r.site_id = s.id) 
JOIN   report_profile   p  ON (s.id = p.site_id) 
JOIN   crm_client       c  ON (p.client_id = c.id) 
JOIN   auth_user        u  ON (c.user_id = u.id)
-- UPDATE2: t7 has left the building
WHERE  u.is_active
AND    c.is_deleted = FALSE
AND NOT EXISTS (SELECT * FROM report_rank x
       -- WHERE 1=1 -- uncorrelated subquery ??
       -- UPDATE1: no it's not. Erwin seems to have forgotten the t7 join
       WHERE r.id = x.site_id
       AND x.created > r.created
       ) 
;

答案 2 :(得分:0)

替代解释

我正在忙着优化你提出的查询并错过了你所写的内容:

  

我正在尝试为每个report_profile检索最新的report_rank。

与您的查询尝试完全不同的

首先,让我演示一下如何从您发布的内容中提取查询 我删除了""和干扰词,使用了别名并修剪了格式,到达此处:

SELECT r.id, r.keyword_id, r.site_id, r.rank, r.url, r.competition
      ,r.source, r.country, r.created
      ,MAX(t7.created) AS max 
FROM   report_rank      r
LEFT   JOIN report_site s  ON (s.id      = r.site_id) 
JOIN   report_profile   p  ON (p.site_id = s.id) 
JOIN   crm_client       c  ON (c.id      = p.client_id) 
JOIN   auth_user        u  ON (u.id      = c.user_id) 
LEFT   JOIN report_rank t7 ON (t.site_id = s.id) 
WHERE  u.is_active
AND    c.is_deleted = False
GROUP  BY
       r.id
      ,r.keyword_id
      ,r.site_id
      ,r.rank
      ,r.url, r.competition
      ,r.source
      ,r.country
      ,r.created 
HAVING MAX(t7.created) =  r.created;
  • 您尝试使用T7HAVING无法对校长工作,我对此进行了修剪。
  • 在这两种情况下,
  • LEFT JOIN将被强制为普通JOIN。我做了相应的替换。
  • 从您的查询中我推断report_sitereport_rankreport_profile的关系是1:n,这就是这两者的关联方式。因此,属于同一report_profile的{​​{1}}共享相同的最新report_site。您也可以按report_rank分组。但我坚持提出的问题。
  • 我从查询中删除了report_site。它是无关紧要的,只要它存在,我断言。
  • 从PostgreSQL 9.1开始,每个表的主键就足够report_site了。我相应地简化了。
  • 为简化起见,我选择了GROUP BY
  • 的所有列

尽管如此,我还是来到了基本查询

report_rank

在此基础上,我用...创建了一个解决方案。

每个SELECT r.* FROM report_rank r JOIN report_profile p USING (site_id) JOIN crm_client c ON (c.id = p.client_id) JOIN auth_user u ON (u.id = c.user_id) WHERE u.is_active AND c.is_deleted = FALSE GROUP BY r.id;

的最新report_rank
report_profile
  • 我假设有一个WITH p AS ( SELECT p.id AS profile_id ,p.site_id FROM report_profile p WHERE EXISTS ( SELECT * FROM crm_client c JOIN auth_user u ON u.id = c.user_id WHERE c.id = p.client_id AND c.is_deleted = FALSE AND u.is_active ) ) x AS ( SELECT p.profile_id ,r.* FROM p JOIN report_rank r USING (site_id) ) SELECT * FROM x WHERE NOT EXISTS ( SELECT * FROM x r WHERE r.profile_id = x.profile_id AND r.created > x.created ); ,但你没有提到它。
  • 在第一次CTE中,我获得了一套独特的有效资料。
  • 在第二次CTE中,我与report_profile.id加入以生成结果行
  • 在最终查询中,我删除了除report_rank以外的所有report_rank
  • 以外的所有内容
  • 如果report_profile不唯一,可以是一行或多行。
  • 我的其他答案中带有部分索引的解决方案不适用于此变体。

最后,来自PostgreSQL wiki的性能优化建议: