关于这个主题有很多问题,但是所有这些问题看起来都比我现在看到的情况更复杂,并且答案看起来并不合适。
OHDSI=> \d record_counts
Table "results2.record_counts"
Column | Type | Modifiers
------------------------+-----------------------+-----------
concept_id | integer |
schema | text |
table_name | text |
column_name | text |
column_type | text |
descendant_concept_ids | bigint |
rc | numeric |
drc | numeric |
domain_id | character varying(20) |
vocabulary_id | character varying(20) |
concept_class_id | character varying(20) |
standard_concept | character varying(1) |
Indexes:
"rc_dom" btree (domain_id, concept_id)
"rcdom" btree (domain_id)
"rcdomvocsc" btree (domain_id, vocabulary_id, standard_concept)
该表有3,133,778条记录,因此Postgres不应该忽略索引,因为表的大小很小。
我对已编入索引的domain_id进行过滤,并忽略索引:
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
------------------------------------------------------------------------
Seq Scan on record_counts (cost=0.00..76744.81 rows=2079187 width=87)
Filter: ((domain_id)::text = 'Drug'::text)
我关闭了seqscan并且:
OHDSI=> set enable_seqscan=false;
SET
OHDSI=> explain select * from record_counts where domain_id = 'Drug';
QUERY PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on record_counts (cost=42042.13..105605.97 rows=2079187 width=87)
Recheck Cond: ((domain_id)::text = 'Drug'::text)
-> Bitmap Index Scan on rcdom (cost=0.00..41522.33 rows=2079187 width=0)
Index Cond: ((domain_id)::text = 'Drug'::text)
事实上,该计划表示使用索引的成本会比不高,但为什么呢?如果索引允许它处理更少的记录,那么它的使用速度是否应该更快?
答案 0 :(得分:2)
好吧,看起来Postgres知道它在做什么。我使用的索引列的特定值(' Drug')恰好占了表中66%的行。所以,是的,过滤器使行集明显变小,但由于这些行将在页面之间分散,因此索引不允许更快地检索它们。
OHDSI=> select domain_id, count(*) as rows, round((100 * count(*)::float / 3133778.0)::numeric,4) pct from record_counts group by 1 order by 2 desc;
domain_id | rows | pct
---------------------+---------+---------
Drug | 2074991 | 66.2137
Condition | 466882 | 14.8984
Observation | 217807 | 6.9503
Procedure | 165800 | 5.2907
Measurement | 127239 | 4.0602
Device | 29410 | 0.9385
Spec Anatomic Site | 28783 | 0.9185
Meas Value | 10415 | 0.3323
Unit | 2350 | 0.0750
Type Concept | 2170 | 0.0692
Provider Specialty | 1957 | 0.0624
Specimen | 1767 | 0.0564
Metadata | 1689 | 0.0539
Revenue Code | 538 | 0.0172
Place of Service | 480 | 0.0153
Race | 467 | 0.0149
Relationship | 242 | 0.0077
Condition/Obs | 182 | 0.0058
Currency | 180 | 0.0057
Condition/Meas | 115 | 0.0037
Route | 81 | 0.0026
Obs/Procedure | 78 | 0.0025
Condition/Device | 52 | 0.0017
Condition/Procedure | 25 | 0.0008
Meas/Procedure | 25 | 0.0008
Gender | 19 | 0.0006
Device/Procedure | 9 | 0.0003
Meas Value Operator | 9 | 0.0003
Visit | 8 | 0.0003
Drug/Procedure | 3 | 0.0001
Spec Disease Status | 3 | 0.0001
Ethnicity | 2 | 0.0001
当我在where子句中使用任何其他值时(包括'条件',有15%的行),Postgres确实使用了索引。
(有点令人惊讶的是,即使我根据domain_id索引对表格进行聚类,当我过滤掉“药物”时,它仍然没有使用索引,但过滤掉了性能提升34行的百分比似乎不值得进一步追求。)