Question

数据库新手在这里。这是我的查询，我正在使用Postgres 9.3.5：

=# explain analyse SELECT SUM(actual_cost) as cost, SUM(total_items) 
   as num_items, processing_date FROM frontend_items 
   WHERE chemical_id='0501013B0' GROUP BY processing_date;

这是查询计划：

HashAggregate  (cost=1648624.91..1648624.92 rows=1 width=16) (actual time=12591.844..12591.848 rows=17 loops=1)
   ->  Bitmap Heap Scan on frontend_items  (cost=14520.24..1643821.35 rows=640474 width=16) (actual time=254.841..12317.746 rows=724242 loops=1)
         Recheck Cond: ((chemical_id)::text = '0501013B0'::text)
         ->  Bitmap Index Scan on frontend_items_chemical_id_varchar_pattern_ops_idx  (cost=0.00..14360.12 rows=640474 width=0) (actual time=209.538..209.538 rows=724242 loops=1)
               Index Cond: ((chemical_id)::text = '0501013B0'::text)
 Total runtime: 12592.499 ms

正如您所看到的，大部分时间都是Bitmap Heap Scan。有什么方法可以加快速度吗？

如果需要，我可以创建更多索引：我的数据几乎是只读的（每月更新一次）。

我猜我没有太多可以做的，因为我需要多个属性，除了支付足够的RAM来将整个数据库保存在内存中，但建议将非常感激。

我可以一次查找其中一个属性，如果这样可以加快速度。

注意：我在带有16GB内存和SSD的Macbook上运行它。我已将shared_buffers设置为4GB，将work_mem设置为40MB。我最终将使用具有32GB RAM和SSD的服务器。

更新：表架构如下：

     Column       |          Type           |                             Modifiers
-------------------+-------------------------+--------------------------------------------------------------------
 id                | integer                 | not null default nextval('frontend_items_id_seq'::regclass)
 presentation_code | character varying(15)   | not null
 presentation_name | character varying(1000) | not null
 total_items       | integer                 | not null
 net_cost          | double precision        | not null
 actual_cost       | double precision        | not null
 quantity          | double precision        | not null
 processing_date   | date                    | not null
 price_per_unit    | double precision        | not null
 chemical_id       | character varying(9)    | not null
 pct_id            | character varying(3)    | not null
 practice_id       | character varying(6)    | not null
 sha_id            | character varying(3)    | not null
Indexes:
    "frontend_items_pkey" PRIMARY KEY, btree (id)
    "frontend_items_45fff4c7" btree (sha_id)
    "frontend_items_4e2e609b" btree (pct_id)
    "frontend_items_528f368c" btree (processing_date)
    "frontend_items_6ea07fe3" btree (practice_id)
    "frontend_items_a69d813a" btree (chemical_id)
    "frontend_items_b9b2c7ab" btree (presentation_code)
    "frontend_items_chemical_id_varchar_pattern_ops_idx" btree (chemical_id varchar_pattern_ops)
    "frontend_items_pct_code_id_488a8bbfb2bddc6d_like" btree (pct_id varchar_pattern_ops)
    "frontend_items_practice_id_bbbafffdb2c2bf1_like" btree (practice_id varchar_pattern_ops)
    "frontend_items_presentation_code_69403ee04fda6522_like" btree (presentation_code varchar_pattern_ops)
    "frontend_items_presentation_code_varchar_pattern_ops_idx" btree (presentation_code varchar_pattern_ops)
Foreign-key constraints:
    "front_chemical_id_4619f68f65c49a8_fk_frontend_chemical_bnf_code" FOREIGN KEY (chemical_id) REFERENCES frontend_chemical(bnf_code) DEFERRABLE INITIALLY DEFERRED
    "frontend__practice_id_bbbafffdb2c2bf1_fk_frontend_practice_code" FOREIGN KEY (practice_id) REFERENCES frontend_practice(code) DEFERRABLE INITIALLY DEFERRED
    "frontend_items_pct_id_30c06df242c3d1ba_fk_frontend_pct_code" FOREIGN KEY (pct_id) REFERENCES frontend_pct(code) DEFERRABLE INITIALLY DEFERRED
    "frontend_items_sha_id_4fa0ca3c3b9b67f_fk_frontend_sha_code" FOREIGN KEY (sha_id) REFERENCES frontend_sha(code) DEFERRABLE INITIALLY DEFERRED

这是详细解释的输出：

# explain (verbose, buffers, analyse) SELECT SUM(actual_cost) as cost, SUM(total_items) as num_items, processing_date FROM frontend_items WHERE chemical_id='0501012G0' GROUP BY processing_date;



    QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1415349.73..1415349.74 rows=1 width=16) (actual time=3048.551..3048.556 rows=17 loops=1)
   Output: sum(actual_cost), sum(total_items), processing_date
   Buffers: shared hit=141958 read=12725
   ->  Bitmap Heap Scan on public.frontend_items  (cost=11797.55..1411446.84 rows=520385 width=16) (actual time=213.889..2834.911 rows=524644 loops=1)
         Output: id, presentation_code, presentation_name, total_items, net_cost, actual_cost, quantity, processing_date, price_per_unit, chemical_id, pct_id, practice_id, sha_id
         Recheck Cond: ((frontend_items.chemical_id)::text = '0501012G0'::text)
         Buffers: shared hit=141958 read=12725
         ->  Bitmap Index Scan on frontend_items_chemical_id_varchar_pattern_ops_idx  (cost=0.00..11667.46 rows=520385 width=0) (actual time=172.574..172.574 rows=524644 loops=1)
               Index Cond: ((frontend_items.chemical_id)::text = '0501012G0'::text)
               Buffers: shared hit=2 read=2012
 Total runtime: 3049.177 ms

Answer 1

您有724242行，查询占用12592.499 ms。这是每行0.017387毫秒，即每秒57514行。你在抱怨什么？我认为您的查询速度很快。普通HDD通过索引扫描仅支持每秒65到200行的速率，尽管位图索引/堆扫描速度更快。我想你会发现PostgreSQL正在为你的情况使用最好的查询计划。

如果再次执行查询，它会变快吗？缓存会很热，因此重复执行可能会更快。如果它没有变得更快，那么更多的内存不太可能有所帮助。 PostgreSQL的数据页大小为8 KB，因此您最多可以访问724242 * 8 KB = 5.5 GB的数据，即数据应该适合您的RAM。

编辑：问题的编辑版本中提到的第二个查询显示每秒172000行的性能。因此，如果数据缓存在RAM中，则此类查询可能会变得更快。我会选择在RAM中拟合整个数据集的方法。 RAM很便宜，但开发时间很贵。

Answer 2

由于所选行是特定的，因此在这种情况下添加多列索引可能会有所帮助。可能会有助于使计划员轻描淡写使用索引扫描而不是位图扫描。

CREATE INDEX ON frontend_items(chemical_id, processing_date, total_items, actual_cost);

请注意，这是可能的，但在优化查询时值得尝试。

有没有办法加快这个Postgres位图堆扫描？

2 个答案: