我有一张产品表,大约有17,000,000条记录。
CREATE TABLE vendor_prices ( id serial PRIMARY KEY, vendor integer NOT NULL, sku character varying(25) NOT NULL, category_name character varying(100) NOT NULL, price numeric(8,5) NOT NULL, effective_date timestamp without time zone, expiration_date timestamp without time zone DEFAULT (now() + '1 year'::interval) );
大多数记录都是多余的,我只保留一张单独的表格,只包含相关记录。我使用 dba.stackexchange 上的this answer有效删除重复记录。
以下是查询的核心(使用select
代替delete
):
select * from ( SELECT id, rate = lag(price) OVER w AND (lead(id) OVER w) IS NOT NULL AS del FROM vendor_prices WHERE vendor = 516 WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id) ) d WHERE NOT d.del
虽然我遇到的问题是执行时间太长,特别是对于拥有大量记录的供应商而言。此查询WHERE vendor = 516
中的此特定供应商有一些3M行,其中只有大约80K不是多余的。我该怎么做才能改进这个查询。
以下是EXPLAIN ANALYZE
的结果:
Aggregate (cost=987648.74..987648.75 rows=1 width=0) (actual time=38220.825..38220.825 rows=1 loops=1) -> Subquery Scan on d (cost=862040.12..983596.85 rows=1620756 width=0) (actual time=31758.342..38211.262 rows=84245 loops=1) Filter: (NOT d.del) Rows Removed by Filter: 3094780 -> WindowAgg (cost=862040.12..951181.72 rows=3241513 width=25) (actual time=31758.220..37929.024 rows=3179025 loops=1) -> Sort (cost=862040.12..870143.90 rows=3241513 width=25) (actual time=31758.196..34952.249 rows=3179025 loops=1) Sort Key: vendor_prices.sku, vendor_prices.effective_date, vendor_prices.id Sort Method: external merge Disk: 123448kB -> Bitmap Heap Scan on vendor_prices (cost=60790.16..356386.08 rows=3241513 width=25) (actual time=350.911..1512.974 rows=3179025 loops=1) Recheck Cond: (vendor = 516) Heap Blocks: exact=47546 -> Bitmap Index Scan on idx_vendor_number (cost=0.00..59979.79 rows=3241513 width=0) (actual time=336.936..336.936 rows=3179025 loops=1) Index Cond: (vendor = 516)
PS。我有@Erwin建议的multicolumn index
:
(vendor, sku, effective_date, id)
上的[多列索引]对此非常适合 - 按此特定顺序。但它正在使用idx_vendor_number
,因为您只能在EXPLAIN ANALYZE
栏中看到vendor