Question

我们经常进行迁移，而最初的正常迁移使Postgres选择了错误的查询计划，从而导致超慢查询。查询非常糟糕，最终导致我们的网站瘫痪。

迁移删除了现有表上的空约束。在此示例中，它删除了items.store_id上的null约束。

items表上的索引：

CREATE INDEX index_items_store_id ON public.items USING btree (store_id)
CREATE INDEX index_items_on_department_id ON public.items USING btree (department_id)
CREATE INDEX index_items_on_deleted_at ON public.items USING btree (deleted_at)
CREATE UNIQUE INDEX items_pkey ON public.items USING btree (id)

查询示例：

SELECT "stores".*
  FROM "stores"
 WHERE "stores"."organization_id" = 1337
   AND "stores"."store_status_id" = 1
   AND (EXISTS (SELECT "items".*
                  FROM "items"
                 INNER JOIN "departments"
                    ON "departments"."id" = "items"."department_id"
                 WHERE "items"."deleted_at" IS NULL
                   AND ("items"."department_id" IS NOT NULL)
                   AND (items.store_id = stores.id)
                   AND "items"."job_application_status_id" = 3
                   AND "departments"."department_status_id" = 3));

错误的查询计划：

Nested Loop  (cost=216890.17..217379.24 rows=192 width=1236)
   ->  HashAggregate  (cost=216889.74..216891.74 rows=200 width=4)
         Group Key: items.store_id
         ->  Merge Join  (cost=2.13..216769.74 rows=48000 width=4)
               Merge Cond: (departments.id = items.department_id)
               ->  Index Scan using departments_pkey on departments  (cost=0.41..10394.19 rows=8925 width=4)
                     Filter: (department_status_id = 3)
               ->  Index Scan using index_items_on_department_id on items  (cost=0.55..309417.94 rows=8 8783 width=8)
                     Index Cond: (department_id IS NOT NULL)
                     Filter: ((deleted_at IS NULL) AND (job_application_status_id = 3))
   ->  Index Scan using stores_pkey on stores  (cost=0.42..2.43 rows=1 width=1236)
         Index Cond: (id = items.store_id)
         Filter: ((organization_id = 1337) AND (store_status_id = 1))

好的查询计划：

Nested Loop Semi Join  (cost=1.26..4566.90 rows=21 width=1236)
   ->  Index Scan using index_stores_on_organization_id on stores  (cost=0.42..2236.83 rows=385 width=1236)
         Index Cond: (organization_id = 1337)
         Filter: (store_status_id = 1)
   ->  Nested Loop  (cost=0.84..6.04 rows=1 width=4)
         ->  Index Scan using index_gh_job_app_store_id on items  (cost=0.43..5.46 rows=1 width=8)
               Index Cond: (store_id = stores.id)
               Filter: ((deleted_at IS NULL) AND (department_id IS NOT NULL) AND (job_application_status_id = 3))
         ->  Index Scan using departments_pkey on departments  (cost=0.41..0.57 rows=1 width=4)
               Index Cond: (id = items.department_id)
               Filter: (department_status_id = 3)

在ANALYZE表上完成items后，Postgres选择了好的查询计划器，一切都变好了。我们已尝试在本地和其他登台环境上重现此迁移，但无法重现。在很多次之前，我们也一直在进行此类迁移，并且从未遇到过此类问题。我们怀疑这与该表在任何给定时间接收的查询量有关，这就是为什么难以复制的原因。

但是我们实际上在Postgres如何选择最佳查询计划程序以及将来如何避免这种情况方面并没有太多专业知识。如果有人对为什么会发生或如何避免有任何想法，将不胜感激。

在高负载的数据库上进行迁移后，Postgres 9.6选择了错误的查询计划

0 个答案: