在Postgresql

时间:2016-07-14 18:40:03

标签: sql postgresql window-functions postgresql-performance

我有一张产品表,大约有17,000,000条记录。

CREATE TABLE vendor_prices (
  id serial PRIMARY KEY,
  vendor integer NOT NULL,
  sku character varying(25) NOT NULL,
  category_name character varying(100) NOT NULL,
  price numeric(8,5) NOT NULL,
  effective_date timestamp without time zone,
  expiration_date timestamp without time zone DEFAULT (now() + '1 year'::interval)
);

大多数记录都是多余的,我只保留一张单独的表格,只包含相关记录。我使用 dba.stackexchange 上的this answer有效删除重复记录。

以下是查询的核心(使用select代替delete

select * from (   
    SELECT id,
        rate = lag(price) OVER w
        AND (lead(id) OVER w) IS NOT NULL AS del
      FROM vendor_prices
      WHERE vendor = 516

      WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
) d
   WHERE NOT d.del

虽然我遇到的问题是执行时间太长,特别是对于拥有大量记录的供应商而言。此查询WHERE vendor = 516中的此特定供应商有一些3M行,其中只有大约80K不是多余的。我该怎么做才能改进这个查询。

以下是EXPLAIN ANALYZE的结果:

Aggregate  (cost=987648.74..987648.75 rows=1 width=0) (actual time=38220.825..38220.825 rows=1 loops=1)
  ->  Subquery Scan on d  (cost=862040.12..983596.85 rows=1620756 width=0) (actual time=31758.342..38211.262 rows=84245 loops=1)
        Filter: (NOT d.del)
        Rows Removed by Filter: 3094780
        ->  WindowAgg  (cost=862040.12..951181.72 rows=3241513 width=25) (actual time=31758.220..37929.024 rows=3179025 loops=1)
              ->  Sort  (cost=862040.12..870143.90 rows=3241513 width=25) (actual time=31758.196..34952.249 rows=3179025 loops=1)
                    Sort Key: vendor_prices.sku, vendor_prices.effective_date, vendor_prices.id
                    Sort Method: external merge  Disk: 123448kB
                    ->  Bitmap Heap Scan on vendor_prices  (cost=60790.16..356386.08 rows=3241513 width=25) (actual time=350.911..1512.974 rows=3179025 loops=1)
                          Recheck Cond: (vendor = 516)
                          Heap Blocks: exact=47546
                          ->  Bitmap Index Scan on idx_vendor_number  (cost=0.00..59979.79 rows=3241513 width=0) (actual time=336.936..336.936 rows=3179025 loops=1)
                                Index Cond: (vendor = 516)

PS。我有@Erwin建议的multicolumn index

  • (vendor, sku, effective_date, id)上的[多列索引]对此非常适合 - 按此特定顺序。

但它正在使用idx_vendor_number,因为您只能在EXPLAIN ANALYZE栏中看到vendor

0 个答案:

没有答案