在为连接中的WHERE子句添加一个条件时,我遇到了5x-10x的减速。我已经验证了索引正在使用,这是一个非常简单的查询,有2个连接:
此查询需要0.5秒:
EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses"
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=60600.79..60600.80 rows=1 width=0) (actual time=741.224..741.225 rows=1 loops=1)
-> Hash Join (cost=321.63..60600.78 rows=2 width=0) (actual time=23.360..740.475 rows=795 loops=1)
Hash Cond: (businesses.postal_code_id = postal_codes.id)
-> Nested Loop (cost=184.63..60400.82 rows=16784 width=4) (actual time=19.200..690.901 rows=58076 loops=1)
-> Bitmap Heap Scan on categorizations (cost=184.20..17662.46 rows=16784 width=4) (actual time=19.164..131.991 rows=58076 loops=1)
Recheck Cond: (category_id = ANY ('{958,968,936}'::integer[]))
-> Bitmap Index Scan on categorizations_category_id (cost=0.00..180.00 rows=16784 width=0) (actual time=9.994..9.994 rows=58076 loops=1)
Index Cond: (category_id = ANY ('{958,968,936}'::integer[]))
-> Index Scan using businesses_pkey on businesses (cost=0.43..2.54 rows=1 width=8) (actual time=0.005..0.006 rows=1 loops=58076)
Index Cond: (id = categorizations.business_id)
-> Hash (cost=135.49..135.49 rows=121 width=4) (actual time=0.449..0.449 rows=150 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 6kB
-> Index Scan using idx_postal_codes_lower_city on postal_codes (cost=0.43..135.49 rows=121 width=4) (actual time=0.037..0.312 rows=150 loops=1)
Index Cond: (lower((city)::text) = 'new york'::text)
Total runtime: 741.321 ms
(15 rows)
但只添加一个条件(区域)将平均值推到4秒:
EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses"
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york') AND lower(region) = 'new york';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1312.76..1312.77 rows=1 width=0) (actual time=2879.764..2879.765 rows=1 loops=1)
-> Nested Loop (cost=16.77..1312.76 rows=1 width=0) (actual time=4.740..2878.936 rows=795 loops=1)
-> Nested Loop (cost=16.21..1281.22 rows=18 width=4) (actual time=2.259..780.067 rows=206972 loops=1)
-> Index Scan using idx_postal_codes_city_region_country on postal_codes (cost=0.43..2.65 rows=1 width=4) (actual time=0.052..0.432 rows=150 loops=1)
Index Cond: ((lower((city)::text) = 'new york'::text) AND (lower((region)::text) = 'new york'::text))
-> Bitmap Heap Scan on businesses (cost=15.78..1267.29 rows=1128 width=8) (actual time=0.377..3.179 rows=1380 loops=150)
Recheck Cond: (postal_code_id = postal_codes.id)
-> Bitmap Index Scan on index_businesses_on_postal_code_id (cost=0.00..15.49 rows=1128 width=0) (actual time=0.219..0.219 rows=1380 loops=150)
Index Cond: (postal_code_id = postal_codes.id)
-> Index Only Scan using index_categorizations_on_business_id_and_category_id_and_source on categorizations (cost=0.56..1.74 rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=206972)
Index Cond: ((business_id = businesses.id) AND (category_id = ANY ('{958,968,936}'::integer[])))
Heap Fetches: 2
Total runtime: 2879.854 ms
(13 rows)
注意 - 我指的是平均值,而不是依赖于您查看查询时间,以避免声称缓存可能会误导我。虽然这些表非常大(15密耳的业务,20密耳的分类,1密耳的邮政编码),但如果我改变postal_code条件,我不会期望性能发生剧烈变化。事实上,我认为它会更快,因为它可以加入更少。考虑到这样的基本查询,我对调整选项犹豫不决。
以下是postal_codes表上的索引。注意 - 我知道它们并非都是必需的。我现在正在玩,所以当查询开始正常运行时,我会删除不必要的内容。
\d postal_codes;
Table "public.postal_codes"
Column | Type | Modifiers
----------------+------------------------+-----------------------------------------------------------
id | integer | not null default nextval('postal_codes_id_seq'::regclass)
code | character varying(255) |
city | character varying(255) |
region | character varying(255) |
country | character varying(255) |
num_businesses | integer |
region_abbr | text |
Indexes:
"postal_codes_pkey" PRIMARY KEY, btree (id)
"idx_postal_codes_city_region_country" btree (lower(city::text), lower(region::text), country)
"idx_postal_codes_lower_city" btree (lower(city::text))
"idx_postal_codes_lower_region" btree (lower(region::text))
"idx_region_city_postal_codes" btree (lower(region::text), lower(city::text))
"index_postal_codes_on_code" btree (code)
版本和相关调整参数(如果我应该考虑其他人,请告诉我):
server_version | 9.3.4
cpu_tuple_cost | 0.01
effective_cache_size | 16GB
maintenance_work_mem | 1GB
random_page_cost | 1.1
seq_page_cost | 1
shared_buffers | 8GB
work_mem | 1GB
我还打开了AUTOVACCUUM并重新分析了商家,分类和邮政编码(尽管我认为这并不重要)
答案 0 :(得分:1)
真正的答案:规范化。在原始邮政编码表中,过多的{country,region,city}信息是重复的。挤压这个"域"进入一个单独的城市"表。例如:
-- temp schema for testing purposes.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
-- A copy of the original table
-- CREATE TABLE tmp.postal_codes_orig AS (SELECT * FROM public.postal_codes);
-- .. Which I dont have ...
CREATE TABLE tmp.postal_codes_orig
( id SERIAL NOT NULL PRIMARY KEY
, code character varying(255) UNIQUE
, city character varying(255)
, region character varying(255)
, country character varying(255)
, num_businesses integer
, region_abbr text
);
-- some data to test it ...
INSERT INTO tmp.postal_codes_orig ( code , city , region , country , num_businesses , region_abbr ) VALUES
( '3500' , 'Utrecht' , 'Utrecht', 'Nederland', 1000, 'Ut' )
,( '3501' , 'Utrecht' , 'Utrecht', 'Nederland', 1001, 'UT' )
,( '3502' , 'Utrecht' , 'Utrecht', 'Nederland', 1002, 'Utr.' )
,( '3503' , 'Utrecht' , 'Utrecht', 'Nederland', 1003, 'Utr' )
,( '3504' , 'Utrecht' , 'Utrecht', 'Nederland', 1004, 'Ut.' )
;
-- normalisation: squeeze out "city" domain
CREATE TABLE tmp.cities
( id SERIAL NOT NULL PRIMARY KEY
, city character varying(255)
, region character varying(255) -- could be normalised out ...
, country character varying(255) -- could be normalised out ...
, region_abbr character varying(255)
, UNIQUE (city,region,country)
);
-- table with all original postal codes, but referring to the cities table instead of duplicating it
CREATE TABLE tmp.postal_codes_cities
( id SERIAL NOT NULL PRIMARY KEY
, code character varying(255) UNIQUE
, city_id INTEGER NOT NULL REFERENCES tmp.cities(id)
, num_businesses integer NOT NULL DEFAULT 0 -- this still depends on postal code, not on city
);
-- extract the unique cities domain
INSERT INTO tmp.cities(city,region,country,region_abbr)
SELECT DISTINCT city,region,country
, MIN(region_abbr)
FROM tmp.postal_codes_orig
GROUP BY city,region,country
;
CREATE INDEX ON tmp.cities (lower(city::text), lower(region::text), country);
CREATE INDEX ON tmp.cities (lower(city::text));
CREATE INDEX ON tmp.cities (lower(region::text));
CREATE INDEX ON tmp.cities (lower(region::text), lower(city::text));
-- populate the new postal codes table, retaining the 'stable' ids
INSERT INTO tmp.postal_codes_cities(id, code, city_id, num_businesses)
SELECT pc.id,pc.code, ci.id, pc.num_businesses
FROM tmp.postal_codes_orig pc
JOIN tmp.cities ci ON ci.city = pc.city
AND pc.region = ci.region
AND pc.country = ci.country
;
-- and dont forget to set the sequence
SELECT setval('postal_codes_cities_id_seq', MAX(ci.id)) FROM cities ci ;
-- convenience view mimicking the original table
CREATE VIEW tmp.postal_codes AS
SELECT pc.id AS id
, ci.city AS city
, ci.region AS region
, ci.country AS country
, pc.num_businesses AS num_businesses
, ci.region_abbr AS region_abbr
FROM tmp.postal_codes_cities pc
JOIN tmp.cities ci ON ci.id = pc.city_id
;
SELECT * FROM tmp.postal_codes;
当然,必须调整其他表中的外键,指向新的postal_codes_cities.id。 (或"代码",这对我来说是自然的关键)
BTW:使用规范化模式,您甚至不需要在较低(区域)和较低(城市)上使用愚蠢的索引,因为每个名称只存储一次,因此您可以将其强制为规范形式。答案 1 :(得分:0)
规划师并不知道纽约市的所有地区都在纽约地区,它认为这些选择性是独立的,可以成倍增加。这导致它犯了错误。
答案 2 :(得分:0)
避免重新调整查询的NY / NY支路的最简单方法是将它们放入CTE(CTE不会被优化器分解):
-- EXPLAIN ANALYZE
WITH ny AS (
SELECT pc.id
FROM postal_codes pc
WHERE lower(pc.city) = ('new york') AND lower(pc.region) = 'new york';
)
SELECT COUNT(*) FROM businesses bu
JOIN categorizations ca ON ca.business_id = bu.businesses.id
JOIN ny ON bu.postal_code_id = ny.id
WHERE ca.category_id IN (958,968,936)
;