Question

在为连接中的WHERE子句添加一个条件时，我遇到了5x-10x的减速。我已经验证了索引正在使用，这是一个非常简单的查询，有2个连接：

此查询需要0.5秒：

EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses" 
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york');


                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=60600.79..60600.80 rows=1 width=0) (actual time=741.224..741.225 rows=1 loops=1)
   ->  Hash Join  (cost=321.63..60600.78 rows=2 width=0) (actual time=23.360..740.475 rows=795 loops=1)
         Hash Cond: (businesses.postal_code_id = postal_codes.id)
         ->  Nested Loop  (cost=184.63..60400.82 rows=16784 width=4) (actual time=19.200..690.901 rows=58076 loops=1)
               ->  Bitmap Heap Scan on categorizations  (cost=184.20..17662.46 rows=16784 width=4) (actual time=19.164..131.991 rows=58076 loops=1)
                     Recheck Cond: (category_id = ANY ('{958,968,936}'::integer[]))
                     ->  Bitmap Index Scan on categorizations_category_id  (cost=0.00..180.00 rows=16784 width=0) (actual time=9.994..9.994 rows=58076 loops=1)
                           Index Cond: (category_id = ANY ('{958,968,936}'::integer[]))
               ->  Index Scan using businesses_pkey on businesses  (cost=0.43..2.54 rows=1 width=8) (actual time=0.005..0.006 rows=1 loops=58076)
                     Index Cond: (id = categorizations.business_id)
         ->  Hash  (cost=135.49..135.49 rows=121 width=4) (actual time=0.449..0.449 rows=150 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 6kB
               ->  Index Scan using idx_postal_codes_lower_city on postal_codes  (cost=0.43..135.49 rows=121 width=4) (actual time=0.037..0.312 rows=150 loops=1)
                     Index Cond: (lower((city)::text) = 'new york'::text)
 Total runtime: 741.321 ms
(15 rows)

但只添加一个条件（区域）将平均值推到4秒：

EXPLAIN ANALYZE SELECT COUNT(*) FROM "businesses" 
INNER JOIN categorizations ON categorizations.business_id = businesses.id
INNER JOIN postal_codes ON businesses.postal_code_id = postal_codes.id
WHERE categorizations.category_id IN (958,968,936)
AND lower(city) IN ('new york') AND lower(region) = 'new york';

                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1312.76..1312.77 rows=1 width=0) (actual time=2879.764..2879.765 rows=1 loops=1)
   ->  Nested Loop  (cost=16.77..1312.76 rows=1 width=0) (actual time=4.740..2878.936 rows=795 loops=1)
         ->  Nested Loop  (cost=16.21..1281.22 rows=18 width=4) (actual time=2.259..780.067 rows=206972 loops=1)
               ->  Index Scan using idx_postal_codes_city_region_country on postal_codes  (cost=0.43..2.65 rows=1 width=4) (actual time=0.052..0.432 rows=150 loops=1)
                     Index Cond: ((lower((city)::text) = 'new york'::text) AND (lower((region)::text) = 'new york'::text))
               ->  Bitmap Heap Scan on businesses  (cost=15.78..1267.29 rows=1128 width=8) (actual time=0.377..3.179 rows=1380 loops=150)
                     Recheck Cond: (postal_code_id = postal_codes.id)
                     ->  Bitmap Index Scan on index_businesses_on_postal_code_id  (cost=0.00..15.49 rows=1128 width=0) (actual time=0.219..0.219 rows=1380 loops=150)
                           Index Cond: (postal_code_id = postal_codes.id)
         ->  Index Only Scan using index_categorizations_on_business_id_and_category_id_and_source on categorizations  (cost=0.56..1.74 rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=206972)
               Index Cond: ((business_id = businesses.id) AND (category_id = ANY ('{958,968,936}'::integer[])))
               Heap Fetches: 2
 Total runtime: 2879.854 ms
(13 rows)

注意 - 我指的是平均值，而不是依赖于您查看查询时间，以避免声称缓存可能会误导我。虽然这些表非常大（15密耳的业务，20密耳的分类，1密耳的邮政编码），但如果我改变postal_code条件，我不会期望性能发生剧烈变化。事实上，我认为它会更快，因为它可以加入更少。考虑到这样的基本查询，我对调整选项犹豫不决。

以下是postal_codes表上的索引。注意 - 我知道它们并非都是必需的。我现在正在玩，所以当查询开始正常运行时，我会删除不必要的内容。

\d postal_codes;
                                     Table "public.postal_codes"
     Column     |          Type          |                         Modifiers                         
----------------+------------------------+-----------------------------------------------------------
 id             | integer                | not null default nextval('postal_codes_id_seq'::regclass)
 code           | character varying(255) | 
 city           | character varying(255) | 
 region         | character varying(255) | 
 country        | character varying(255) | 
 num_businesses | integer                | 
 region_abbr    | text                   | 
Indexes:
    "postal_codes_pkey" PRIMARY KEY, btree (id)
    "idx_postal_codes_city_region_country" btree (lower(city::text), lower(region::text), country)
    "idx_postal_codes_lower_city" btree (lower(city::text))
    "idx_postal_codes_lower_region" btree (lower(region::text))
    "idx_region_city_postal_codes" btree (lower(region::text), lower(city::text))
    "index_postal_codes_on_code" btree (code)

版本和相关调整参数（如果我应该考虑其他人，请告诉我）：

server_version                      | 9.3.4
cpu_tuple_cost                      | 0.01
effective_cache_size                | 16GB
maintenance_work_mem                | 1GB
random_page_cost                    | 1.1
seq_page_cost                       | 1
shared_buffers                      | 8GB
work_mem                            | 1GB

我还打开了AUTOVACCUUM并重新分析了商家，分类和邮政编码（尽管我认为这并不重要）

Answer 1

真正的答案：规范化。在原始邮政编码表中，过多的{country，region，city}信息是重复的。挤压这个＆＃34;域＆＃34;进入一个单独的城市＆＃34;表。例如：

        -- temp schema for testing purposes.
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;

        -- A copy of the original table
-- CREATE TABLE tmp.postal_codes_orig AS (SELECT * FROM public.postal_codes);
        -- .. Which I dont have ...
CREATE TABLE tmp.postal_codes_orig
        ( id SERIAL NOT NULL PRIMARY KEY
        , code character varying(255) UNIQUE
        , city character varying(255)
        , region character varying(255)
        , country character varying(255)
        , num_businesses integer
        , region_abbr text
        );

        -- some data to test it ...
INSERT INTO tmp.postal_codes_orig ( code , city , region , country , num_businesses , region_abbr ) VALUES
 ( '3500' , 'Utrecht' , 'Utrecht', 'Nederland', 1000, 'Ut' )
,( '3501' , 'Utrecht' , 'Utrecht', 'Nederland', 1001, 'UT' )
,( '3502' , 'Utrecht' , 'Utrecht', 'Nederland', 1002, 'Utr.' )
,( '3503' , 'Utrecht' , 'Utrecht', 'Nederland', 1003, 'Utr' )
,( '3504' , 'Utrecht' , 'Utrecht', 'Nederland', 1004, 'Ut.' )
        ;

        -- normalisation: squeeze out "city" domain
CREATE TABLE tmp.cities
        ( id SERIAL NOT NULL PRIMARY KEY
        , city character varying(255)
        , region character varying(255)  -- could be normalised out ...
        , country character varying(255) -- could be normalised out ...
        , region_abbr character varying(255)
        , UNIQUE (city,region,country)
        );

    -- table with all original postal codes, but referring to the cities table instead of duplicating it
CREATE TABLE tmp.postal_codes_cities
        ( id SERIAL NOT NULL PRIMARY KEY
        , code character varying(255) UNIQUE
        , city_id INTEGER NOT NULL REFERENCES tmp.cities(id)
        , num_businesses integer NOT NULL DEFAULT 0 -- this still depends on postal code, not on city
        );
        -- extract the unique cities domain
INSERT INTO tmp.cities(city,region,country,region_abbr)
SELECT DISTINCT city,region,country
        , MIN(region_abbr)
FROM tmp.postal_codes_orig
GROUP BY city,region,country
        ;

CREATE INDEX ON tmp.cities (lower(city::text), lower(region::text), country);
CREATE INDEX ON tmp.cities (lower(city::text));
CREATE INDEX ON tmp.cities (lower(region::text));
CREATE INDEX ON tmp.cities (lower(region::text), lower(city::text));

        -- populate the new postal codes table, retaining the 'stable' ids
INSERT INTO tmp.postal_codes_cities(id, code, city_id, num_businesses)
SELECT pc.id,pc.code, ci.id, pc.num_businesses
FROM tmp.postal_codes_orig pc
JOIN tmp.cities ci ON ci.city = pc.city
        AND pc.region = ci.region
        AND pc.country = ci.country
        ;

        -- and dont forget to set the sequence
SELECT setval('postal_codes_cities_id_seq', MAX(ci.id)) FROM cities ci ;


        -- convenience view mimicking the original table
CREATE VIEW tmp.postal_codes AS
SELECT pc.id AS id
        , ci.city AS city
        , ci.region AS region
        , ci.country AS country
        , pc.num_businesses AS num_businesses
        , ci.region_abbr AS region_abbr
FROM tmp.postal_codes_cities pc
JOIN tmp.cities ci ON ci.id = pc.city_id
        ;

SELECT * FROM tmp.postal_codes;

当然，必须调整其他表中的外键，指向新的postal_codes_cities.id。（或＆＃34;代码＆＃34;，这对我来说是自然的关键）

BTW：使用规范化模式，您甚至不需要在较低（区域）和较低（城市）上使用愚蠢的索引，因为每个名称只存储一次，因此您可以将其强制为规范形式。

Answer 2

规划师并不知道纽约市的所有地区都在纽约地区，它认为这些选择性是独立的，可以成倍增加。这导致它犯了错误。

Answer 3

避免重新调整查询的NY / NY支路的最简单方法是将它们放入CTE（CTE不会被优化器分解）：

-- EXPLAIN ANALYZE
WITH ny AS (
        SELECT pc.id 
        FROM postal_codes pc 
        WHERE lower(pc.city) = ('new york') AND lower(pc.region) = 'new york';
        )
SELECT COUNT(*) FROM businesses bu
JOIN categorizations ca ON ca.business_id = bu.businesses.id
JOIN ny ON bu.postal_code_id = ny.id
WHERE ca.category_id IN (958,968,936)
        ;

次要的Postgres查询调整会导致奇怪的性能损失

3 个答案: