如何加快字符串前缀匹配的窗口查询?

时间:2013-05-21 22:00:19

标签: postgresql

我有一个包含大约750万条记录的表,我正在尝试根据所述表实现自动完成表单,但性能非常糟糕。

架构(省略不相关的字段)如下:

COMPANIES
---------
sid (integer primary key)
world_hq_sid (integer)
name (varchar(64))
marketing_alias (varchar(64))
address_country_code (char(4))
address_state (varchar(64))
sort_order integer
search_weight integer
annual_sales integer

传入的字段是可选的country_code和州,以及搜索字词。我想要的是搜索词匹配(不区分大小写)名称或marketing_alias的开头。我想要前十名的结果,这些结果也匹配顶部的国家和州,然后是国家,然后没有州/国家匹配。之后,我希望结果按sort_order排序。

另外,我每个world_hq_sid只需要一个匹配。最后,当我按照world_hq_sid进行最高匹配时,我希望最终结果按search_weight排序。

我正在使用窗口查询来实现world_hq_sid部分。这是查询:

SELECT * FROM (
    SELECT ROW_NUMBER() OVER (PARTITION BY world_hq_sid ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_country_code = 'US' THEN 1 ELSE 0 END desc, sort_order asc) AS r,
    companies.*
    FROM companies
    WHERE ((upper(name) LIKE upper('co%')) OR (upper(marketing_alias) LIKE upper('co%')))
  ) x
  WHERE x.r = 1
  ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_state = 'CA' THEN 1 ELSE 0 END desc, search_weight asc, annual_sales desc
  LIMIT 10;

我在address_state,address_country_code,world_hq_sid,sort_order和search_weight上有正常的btree索引。

我在name和marketing_alias字段上有以下索引:

CREATE INDEX companies_alias_pattern_upper_idx ON companies(upper(marketing_alias) varchar_pattern_ops);
CREATE INDEX companies_name_pattern_upper_idx ON companies(upper(name) varchar_pattern_ops)

以下是我将CA作为状态传递并将'co'作为搜索词

时的解释分析
Limit  (cost=676523.01..676523.03 rows=10 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
 ->  Sort  (cost=676523.01..676526.67 rows=1466 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
     Sort Key: x.search_weight, x.annual_sales
     Sort Method: top-N heapsort  Memory: 30kB
     ->  Subquery Scan on x  (cost=665492.58..676491.33 rows=1466 width=939) (actual time=18344.715..18546.830 rows=151527 loops=1)
           Filter: (x.r = 1)
           Rows Removed by Filter: 20672
           ->  WindowAgg  (cost=665492.58..672825.08 rows=293300 width=931) (actual time=18344.710..18511.625 rows=172199 loops=1)
                 ->  Sort  (cost=665492.58..666225.83 rows=293300 width=931) (actual time=18344.702..18359.145 rows=172199 loops=1)
                       Sort Key: companies.world_hq_sid, (CASE WHEN ((companies.address_state)::text = 'CA'::text) THEN 1 ELSE 0 END), companies.sort_order
                       Sort Method: quicksort  Memory: 108613kB
                       ->  Bitmap Heap Scan on companies  (cost=17236.64..518555.98 rows=293300 width=931) (actual time=1861.665..17999.806 rows=172199 loops=1)
                             Recheck Cond: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
                             Filter: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
                             ->  BitmapOr  (cost=17236.64..17236.64 rows=196219 width=0) (actual time=1829.061..1829.061 rows=0 loops=1)
                                   ->  Bitmap Index Scan on companies_name_pattern_upper_idx  (cost=0.00..8987.98 rows=97772 width=0) (actual time=971.331..971.331 rows=169390 loops=1)
                                         Index Cond: ((upper((name)::text) ~>=~ 'CO'::text) AND (upper((name)::text) ~<~ 'CP'::text))
                                   ->  Bitmap Index Scan on companies_alias_pattern_upper_idx  (cost=0.00..8102.02 rows=98447 width=0) (actual time=857.728..857.728 rows=170616 loops=1)
                                         Index Cond: ((upper((marketing_alias)::text) ~>=~ 'CO'::text) AND (upper((marketing_alias)::text) ~<~ 'CP'::text))

我将work_mem和shared_buffers提升到100M。

如您所见,此查询在18秒后返回。奇怪的是,对于不同的起始字符,结果是全面的,从400ms(可接受)到30秒(非常不可接受)。 Postgres大师,我的问题是,我只是期待过多的postgresql快速执行这样的查询?有没有办法加快速度呢?

2 个答案:

答案 0 :(得分:1)

select *
from (
    select distinct on (world_hq_sid)
        world_hq_sid,
        (address_country_code = 'US')::int + (address_state = 'CA')::int address_weight,
        sort_order,
        search_weight, annual_sales,
        sid, name, marketing_alias,
        address_country_code, address_state
    from companies
    where
        upper(name) LIKE upper('co%')
        OR upper(marketing_alias) LIKE upper('co%')
    order by 1, 2 desc, 3
) s
order by
    address_weight desc,
    search_weight,
    annual_sales desc
limit 10

答案 1 :(得分:0)

对于自动填充,可以使用trigram搜索。

pg_trgm module

CREATE EXTENSION pg_trgm;
ALTER TABLE companies ADD COLUMN name_trgm TEXT NULL;
UPDATE companies SET name_trgm = UPPER(name);

CREATE INDEX companies_name_trgm_gin_idx ON companies USING GIN (name_trgm gin_trgm_ops);