我有一个由Django的ORM生成的查询,这需要花费数小时才能运行。
report_rank
表(5000万行)与report_profile
(100k行)的一对多关系。我正在尝试为每个report_rank
检索最新的report_profile
。
我在一台额外的大型Amazon EC2服务器上运行Postgres 9.1,该服务器有足够的可用内存(使用2GB / 15GB)。磁盘IO当然非常糟糕。
我在report_rank.created
以及所有外键字段都有索引。
如何加快查询速度?我很乐意尝试使用不同的查询方法,如果它将具有高性能,或者调整所需的任何数据库配置参数。
EXPLAIN
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
, MAX(T7."created") AS "max"
FROM "report_rank"
LEFT OUTER JOIN "report_site"
ON ("report_rank"."site_id" = "report_site"."id")
INNER JOIN "report_profile"
ON ("report_site"."id" = "report_profile"."site_id")
INNER JOIN "crm_client"
ON ("report_profile"."client_id" = "crm_client"."id")
INNER JOIN "auth_user"
ON ("crm_client"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "report_rank" T7
ON ("report_site"."id" = T7."site_id")
WHERE ("auth_user"."is_active" = True AND "crm_client"."is_deleted" = False )
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
HAVING MAX(T7."created") = "report_rank"."created";
EXPLAIN
的输出:
GroupAggregate (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
Filter: (max(t7.created) = report_rank.created)
-> Sort (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
-> Hash Join (cost=1323766.36..6107863.59 rows=4658113881 width=72)
Hash Cond: (report_rank.site_id = report_site.id)
-> Seq Scan on report_rank (cost=0.00..1076119.27 rows=48133327 width=64)
-> Hash (cost=1312601.51..1312601.51 rows=893188 width=16)
-> Hash Right Join (cost=47050.38..1312601.51 rows=893188 width=16)
Hash Cond: (t7.site_id = report_site.id)
-> Seq Scan on report_rank t7 (cost=0.00..1076119.27 rows=48133327 width=12)
-> Hash (cost=46692.28..46692.28 rows=28648 width=8)
-> Nested Loop (cost=2201.98..46692.28 rows=28648 width=8)
-> Hash Join (cost=2201.98..5733.23 rows=28648 width=4)
Hash Cond: (crm_client.user_id = auth_user.id)
-> Hash Join (cost=2040.73..5006.71 rows=44606 width=8)
Hash Cond: (report_profile.client_id = crm_client.id)
-> Seq Scan on report_profile (cost=0.00..1706.09 rows=93009 width=8)
-> Hash (cost=1761.98..1761.98 rows=22300 width=8)
-> Seq Scan on crm_client (cost=0.00..1761.98 rows=22300 width=8)
Filter: (NOT is_deleted)
-> Hash (cost=126.85..126.85 rows=2752 width=4)
-> Seq Scan on auth_user (cost=0.00..126.85 rows=2752 width=4)
Filter: is_active
-> Index Scan using report_site_pkey on report_site (cost=0.00..1.42 rows=1 width=4)
Index Cond: (id = report_profile.site_id)
答案 0 :(得分:7)
最重要的一点是,您JOIN
和GROUP
只能得到max(created)
。单独获取此值。
您提到了此处所需的所有索引:在report_rank.created
和外键上。你在那里做得很好。 (如果你对“好”感兴趣,继续阅读!)
LEFT JOIN report_site
条款将强制JOIN
为WHERE
。我用一个简单的JOIN
代替了。我也简化了你的语法。
2015年7月更新,提供更简单,更快速的查询和更智能的功能。
report_rank.created
并非唯一,您想要所有最新的行。
在子查询中使用窗口函数rank()
。
SELECT r.id, r.keyword_id, r.site_id
, r.rank, r.url, r.competition
, r.source, r.country, r.created -- same as "max"
FROM (
SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
FROM report_rank r
WHERE EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE
)
) sub
WHERE rnk = 1;
为什么DESC NULLS LAST
?
如果report_rank.created
唯一,或者您对任意一行 max(created)
感到满意:
SELECT id, keyword_id, site_id
, rank, url, competition
, source, country, created -- same as "max"
FROM report_rank r
WHERE EXISTS (
SELECT 1
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE
)
-- AND r.created > f_report_rank_cap()
ORDER BY r.created DESC NULLS LAST
LIMIT 1;
应该更快,仍然。更多选择:
您可能已经注意到上一个查询中的注释部分:
AND r.created > f_report_rank_cap()
你提到50 mio。行,这很多。这是一种加快速度的方法:
IMMUTABLE
函数,返回一个时间戳,该时间戳保证比感兴趣的行更早,同时尽可能年轻。 WHERE
条件。这是完整的工作演示 @erikcw,您必须按照以下说明激活注释部分。
CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());
-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$; -- or as high as you can safely bet.
-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE created > f_report_rank_cap();
-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
RETURNS void AS
$func$
DECLARE
_secure_margin CONSTANT interval := interval '1 day'; -- adjust to your case
_cap timestamp; -- exclude older rows than this from partial index
BEGIN
SELECT max(created) - _secure_margin
FROM report_rank
WHERE created > f_report_rank_cap() + _secure_margin
/* not needed for the demo; @erikcw needs to activate this
AND EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE)
*/
INTO _cap;
IF FOUND THEN
-- recreate function
EXECUTE format('
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$y$SELECT %L::timestamp$y$', _cap);
-- reindex
REINDEX INDEX report_rank_recent_idx;
END IF;
END
$func$ LANGUAGE plpgsql;
COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
and reindex partial index on report_rank.';
呼叫:
SELECT f_report_rank_set_cap();
请参阅:
SELECT f_report_rank_cap();
在上面的查询中取消注释AND r.created > f_report_rank_cap()
条款并观察其差异。验证索引是否与EXPLAIN ANALYZE
一起使用。
The manual on concurrency and REINDEX
:
要在不干扰生产的情况下构建索引,您应该删除索引并重新发出
CREATE INDEX CONCURRENTLY
命令。
答案 1 :(得分:1)
-- modelled after Erwin's version
-- does the x query really return only one row?
SELECT r.id, r.keyword_id, r.site_id
, r.rank, r.url, r.competition, r.source
, r.country, r.created, x.max_created
-- UPDATE3: I forgot one, too
FROM report_rank r
LEFT JOIN report_site s ON (r.site_id = s.id)
JOIN report_profile p ON (s.id = p.site_id)
JOIN crm_client c ON (p.client_id = c.id)
JOIN auth_user u ON (c.user_id = u.id)
-- UPDATE2: t7 has left the building
WHERE u.is_active
AND c.is_deleted = FALSE
AND NOT EXISTS (SELECT * FROM report_rank x
-- WHERE 1=1 -- uncorrelated subquery ??
-- UPDATE1: no it's not. Erwin seems to have forgotten the t7 join
WHERE r.id = x.site_id
AND x.created > r.created
)
;
答案 2 :(得分:0)
我正在忙着优化你提出的查询并错过了你所写的内容:
我正在尝试为每个report_profile检索最新的report_rank。
与您的查询尝试完全不同的。
首先,让我演示一下如何从您发布的内容中提取查询
我删除了""
和干扰词,使用了别名并修剪了格式,到达此处:
SELECT r.id, r.keyword_id, r.site_id, r.rank, r.url, r.competition
,r.source, r.country, r.created
,MAX(t7.created) AS max
FROM report_rank r
LEFT JOIN report_site s ON (s.id = r.site_id)
JOIN report_profile p ON (p.site_id = s.id)
JOIN crm_client c ON (c.id = p.client_id)
JOIN auth_user u ON (u.id = c.user_id)
LEFT JOIN report_rank t7 ON (t.site_id = s.id)
WHERE u.is_active
AND c.is_deleted = False
GROUP BY
r.id
,r.keyword_id
,r.site_id
,r.rank
,r.url, r.competition
,r.source
,r.country
,r.created
HAVING MAX(t7.created) = r.created;
T7
和HAVING
无法对校长工作,我对此进行了修剪。LEFT JOIN
将被强制为普通JOIN
。我做了相应的替换。report_site
与report_rank
和report_profile
的关系是1:n,这就是这两者的关联方式。因此,属于同一report_profile
的{{1}}共享相同的最新report_site
。您也可以按report_rank
分组。但我坚持提出的问题。report_site
。它是无关紧要的,只要它存在,我断言。report_site
了。我相应地简化了。GROUP BY
尽管如此,我还是来到了基本查询:
report_rank
在此基础上,我用...创建了一个解决方案。
SELECT r.*
FROM report_rank r
JOIN report_profile p USING (site_id)
JOIN crm_client c ON (c.id = p.client_id)
JOIN auth_user u ON (u.id = c.user_id)
WHERE u.is_active
AND c.is_deleted = FALSE
GROUP BY r.id;
report_rank
report_profile
WITH p AS (
SELECT p.id AS profile_id
,p.site_id
FROM report_profile p
WHERE EXISTS (
SELECT *
FROM crm_client c
JOIN auth_user u ON u.id = c.user_id
WHERE c.id = p.client_id
AND c.is_deleted = FALSE
AND u.is_active
)
) x AS (
SELECT p.profile_id
,r.*
FROM p
JOIN report_rank r USING (site_id)
)
SELECT *
FROM x
WHERE NOT EXISTS (
SELECT *
FROM x r
WHERE r.profile_id = x.profile_id
AND r.created > x.created
);
,但你没有提到它。report_profile.id
加入以生成结果行report_rank
以外的所有report_rank
report_profile
不唯一,可以是一行或多行。最后,来自PostgreSQL wiki的性能优化建议: