Question

关于这个主题有很多问题，但是所有这些问题看起来都比我现在看到的情况更复杂，并且答案看起来并不合适。

OHDSI=> \d record_counts
               Table "results2.record_counts"
         Column         |         Type          | Modifiers
------------------------+-----------------------+-----------
 concept_id             | integer               |
 schema                 | text                  |
 table_name             | text                  |
 column_name            | text                  |
 column_type            | text                  |
 descendant_concept_ids | bigint                |
 rc                     | numeric               |
 drc                    | numeric               |
 domain_id              | character varying(20) |
 vocabulary_id          | character varying(20) |
 concept_class_id       | character varying(20) |
 standard_concept       | character varying(1)  |
Indexes:
    "rc_dom" btree (domain_id, concept_id)
    "rcdom" btree (domain_id)
    "rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)

该表有3,133,778条记录，因此Postgres不应该忽略索引，因为表的大小很小。

我对已编入索引的domain_id进行过滤，并忽略索引：

OHDSI=> explain select * from record_counts where domain_id = 'Drug';
                               QUERY PLAN
------------------------------------------------------------------------
 Seq Scan on record_counts  (cost=0.00..76744.81 rows=2079187 width=87)
   Filter: ((domain_id)::text = 'Drug'::text)

我关闭了seqscan并且：

OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
                                     QUERY PLAN
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on record_counts  (cost=42042.13..105605.97 rows=2079187 width=87)
   Recheck Cond: ((domain_id)::text = 'Drug'::text)
   ->  Bitmap Index Scan on rcdom  (cost=0.00..41522.33 rows=2079187 width=0)
         Index Cond: ((domain_id)::text = 'Drug'::text)

事实上，该计划表示使用索引的成本会比不高，但为什么呢？如果索引允许它处理更少的记录，那么它的使用速度是否应该更快？

Answer 1

好吧，看起来Postgres知道它在做什么。我使用的索引列的特定值（＆＃39; Drug＆＃39;）恰好占了表中66％的行。所以，是的，过滤器使行集明显变小，但由于这些行将在页面之间分散，因此索引不允许更快地检索它们。

OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
      domain_id      |  rows   |   pct
---------------------+---------+---------
 Drug                | 2074991 | 66.2137
 Condition           |  466882 | 14.8984
 Observation         |  217807 |  6.9503
 Procedure           |  165800 |  5.2907
 Measurement         |  127239 |  4.0602
 Device              |   29410 |  0.9385
 Spec Anatomic Site  |   28783 |  0.9185
 Meas Value          |   10415 |  0.3323
 Unit                |    2350 |  0.0750
 Type Concept        |    2170 |  0.0692
 Provider Specialty  |    1957 |  0.0624
 Specimen            |    1767 |  0.0564
 Metadata            |    1689 |  0.0539
 Revenue Code        |     538 |  0.0172
 Place of Service    |     480 |  0.0153
 Race                |     467 |  0.0149
 Relationship        |     242 |  0.0077
 Condition/Obs       |     182 |  0.0058
 Currency            |     180 |  0.0057
 Condition/Meas      |     115 |  0.0037
 Route               |      81 |  0.0026
 Obs/Procedure       |      78 |  0.0025
 Condition/Device    |      52 |  0.0017
 Condition/Procedure |      25 |  0.0008
 Meas/Procedure      |      25 |  0.0008
 Gender              |      19 |  0.0006
 Device/Procedure    |       9 |  0.0003
 Meas Value Operator |       9 |  0.0003
 Visit               |       8 |  0.0003
 Drug/Procedure      |       3 |  0.0001
 Spec Disease Status |       3 |  0.0001
 Ethnicity           |       2 |  0.0001

当我在where子句中使用任何其他值时（包括＆＃39;条件＆＃39;，有15％的行），Postgres确实使用了索引。

（有点令人惊讶的是，即使我根据domain_id索引对表格进行聚类，当我过滤掉“药物”时，它仍然没有使用索引，但过滤掉了性能提升34行的百分比似乎不值得进一步追求。）

postgres没有使用索引

1 个答案: