我有一个包含几个表的数据库,每个表有几百万行(表有索引)。我需要计算表中的行数,但只计算那些外键字段指向另一个表的子集的行 这是查询:
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE ( top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )) ))
SELECT Count(*)
FROM cast_info
WHERE EXISTS(SELECT 1
FROM filtered_title
WHERE cast_info.movie_id = filtered_title.id)
AND cast_info.role_id IN( 3, 8 )
我使用CTE,因为对于使用相同子查询的其他表,还有更多的COUNT查询。但是我试图摆脱CTE并且结果是一样的:我第一次执行查询它运行...运行...运行超过十分钟。我第二次执行查询时,它只有4秒,这对我来说是可以接受的。
EXPLAIN ANALYZE
的结果:
Aggregate (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
CTE filtered_title
-> Seq Scan on title top (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 2832906
SubPlan 1
-> Index Scan using title_idx_epof on title sub (cost=0.43..16.16 rows=1 width=0) (never executed)
Index Cond: (episode_of_id = top.id)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
SubPlan 2
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
-> Nested Loop (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
-> HashAggregate (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
-> CTE Scan on filtered_title (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
-> Index Scan using cast_info_idx_mid on cast_info (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
Index Cond: (movie_id = filtered_title.id)
Filter: (role_id = ANY ('{3,8}'::integer[]))
Rows Removed by Filter: 15
Total runtime: 127729.100 ms
现在回答我的问题。我做错了什么,我该如何解决?
我尝试了相同查询的一些变体:独占连接,连接/存在。一方面,这个似乎需要最少的时间来完成工作(快10倍),但它平均仍然是60秒。另一方面,与第一次在第二次运行中需要4-6秒的查询不同,总是需要60秒。
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )))
SELECT Count(*)
FROM cast_info
join filtered_title
ON cast_info.movie_id = filtered_title.id
WHERE cast_info.role_id IN( 3, 8 )
答案 0 :(得分:4)
免责声明:有太多因素可以作出决定性的答案。信息with a few tables, each has a few millions rows (tables do have indexes)
只是没有删除。它取决于基数,表定义,数据类型,使用模式和(可能是最重要的)索引。当然,还有db服务器的正确基本配置。所有这些都超出了关于SO的单个问题的范围。从postgresql-performance标记中的链接开始。或聘请专业人士。
我将在您的查询计划中解决最突出的细节(对我而言):
title
上的顺序扫描?- &GT;标题sub_1上的 Seq Scan (成本= 0.00..90471.23行= 11657宽度= 4)(实际时间= 0.071..730.311 行= 16250 循环= 1)
过滤:((production_year&gt; = 1982)AND(production_year&lt; = 1984)AND(kind_id = ANY(&#39; {1,2}&#39; :: integer [])))
已删除的行数:2832906
大胆强调我的。顺序扫描300万行以仅检索16250不是非常有效。顺序扫描也是第一次运行需要更长时间的可能原因。后续调用可以从缓存中读取数据。由于表格很大,除非你有大量的缓存,否则数据可能不会长时间停留在缓存中。
从大表中收集0.5%的行,索引扫描通常要快得多。可能的原因:
我的钱在索引上。你没有提供你的Postgres版本,所以假设当前的9.3。 此查询的完美索引是:
CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)
数据类型很重要。索引中列的顺序很重要。
kind_id
首先,因为经验法则是:index for equality first — then for ranges
最后两列(id, episode_of_id
)仅对潜在的仅索引扫描有用。如果不适用,请删除它们。更多细节:
PostgreSQL composite primary key
您构建查询的方式最终会在大表上进行两次顺序扫描。所以这是一个有根据的猜测...
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
, t_all AS (
SELECT id FROM t_base
UNION -- not UNION ALL (!)
SELECT id
FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x
JOIN title t USING (id)
)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_all t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
这应该为您在新title_foo_idx
上进行一次索引扫描,并在title
的pk索引上进行另一次索引扫描。其余应该相对便宜。运气好,比以前快得多。
kind_id BETWEEN 1 AND 2
..只要你有一个连续的值范围,这比列出单个值更快,因为这样Postgres可以从索引中获取连续范围。仅仅两个值不是很重要。
为t_all
的第二站测试此替代方案。不确定哪个更快:
SELECT id
FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
你写道:
我使用CTE,因为其他地方有更多COUNT个查询 表,使用相同的子查询。
CTE构成优化障碍,生成的内部工作表未编入索引。当多次重复使用结果(具有多个微不足道的行数)时,使用索引临时表代价是值得的。为简单的int列创建索引很快。
CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
SELECT id FROM t_base
UNION
SELECT id FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);
ANALYZE t_tmp; -- !
CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_tmp t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
-- More queries using t_tmp