修改
这不是关于空间的问题,而是关于索引的唯一性,这会影响查询计划。
就基数而言,哪种指数情景更高:
A
Table:
(
Col1 smallint,
Col2 smallint
)
,其中
Range Col1 : 0 - 1000
Range Col2 : 0 - 1000
和(Col1, Col2)
上的复合索引,始终按顺序查询。
乙
表:
(
Col1_2 int
)
,其中
Range Col1_2 : 0 - 1000^2
和(Col1_2)
上的单个索引,其中存储和查询组合了Col1和Col2组件。
我基本上要问的是,将多个小数字组合在一起(散列)是否更好(如索引用法),还是没有区别?
答案 0 :(得分:4)
复合索引((a, b)
上的索引)与散列函数索引之间的主要区别是:
使用复合索引PostgreSQL可以根据为每个列保留的统计信息做出决策;以及
在复合索引中,您可以仅a
有效地查询索引。不过,你可以不查询b
。
另一方面,对于a::bigint << 32 + b
的索引,即结合了a
和b
的值的64位单元素索引,您只能在您同时拥有a
和b
。 some_hash_function(a,b)
上的索引也是如此。
对于值的散列索引可能有一个很大的优势,因为它使索引变得更小,代价是降低了选择性,并且需要用以下方法重新检查条件:
WHERE some_hash_function(a,b) = some_hash_function(42,3) AND (a = 42 AND b = 3)
您忽略了考虑的可能性:a
和b
上有两个单独的索引。 PostgreSQL可以在位图索引扫描中组合这些,或者单独使用它们,无论哪种更适合查询。这通常是两个松散相关且大部分不相关的值的最佳选择。
举个例子:
CREATE TABLE demoab(a integer, b integer);
INSERT INTO demoab(a, b)
SELECT a, b from generate_series(1,1000) a
CROSS JOIN generate_series(1,1000) b;
CREATE INDEX demoab_a ON demoab(a);
CREATE INDEX demoab_b ON demoab(b);
CREATE INDEX demoab_ab ON demoab(a,b);
CREATE INDEX demoab_ab_shifted ON demoab ((a::bigint << 32 + b));
ANALYZE demoab;
CREATE TABLE demob AS SELECT DISTINCT b FROM demoab ;
CREATE TABLE demoa AS SELECT DISTINCT a FROM demoab ;
ALTER TABLE demoa ADD PRIMARY KEY (a);
ALTER TABLE demob ADD PRIMARY KEY (b);
不同的查询方法:
regress=> explain analyze SELECT * FROM demoab WHERE a = 42 AND b = 3;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.034..0.036 rows=1 loops=1)
Index Cond: ((a = 42) AND (b = 3))
Total runtime: 0.088 ms
(3 rows)
regress=> explain analyze SELECT * FROM demoab WHERE b = 3;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=1.089..4.636 rows=1000 loops=1)
Recheck Cond: (b = 3)
-> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.661..0.661 rows=1000 loops=1)
Index Cond: (b = 3)
Total runtime: 4.820 ms
(5 rows)
regress=> explain analyze SELECT * FROM demoab WHERE a = 42;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using demoab_a on demoab (cost=0.00..37.19 rows=962 width=8) (actual time=0.155..0.751 rows=1000 loops=1)
Index Cond: (a = 42)
Total runtime: 0.929 ms
(3 rows)
regress=> explain analyze SELECT * FROM demoab WHERE (a::bigint << 32 + b) = (42::bigint << 32 + 3);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on demoab (cost=4.69..157.67 rows=41 width=8) (actual time=0.260..0.495 rows=94 loops=1)
Recheck Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
-> Bitmap Index Scan on demoab_ab_shifted (cost=0.00..4.67 rows=41 width=0) (actual time=0.232..0.232 rows=94 loops=1)
Index Cond: (((a)::bigint << (32 + b)) = 1443109011456::bigint)
Total runtime: 0.584 ms
(5 rows)
此处,(a,b)
上的综合索引将是 clear win ,因为它只能使用索引扫描直接获取元组,但它不会#39实际上,您可能会从非索引列中获取值。因此我为这些测试运行SET enable_indexscan = off
。
相当令人惊讶的是,索引大小是相同的:
regress=> SELECT
pg_relation_size('demoab_ab') AS shifted,
pg_relation_size('demoab_ab') AS ab,
pg_relation_size('demoab_a') AS a,
pg_relation_size('demoab_b') AS b;
shifted | ab | a | b
----------+----------+----------+----------
22487040 | 22487040 | 22487040 | 22487040
(1 row)
我预计单值索引需要的空间要少得多。对齐要求解释了其中的一部分,但它对我来说仍然是一个意想不到的结果。
在加入@wildplasser的情况下询问:
regress=> EXPLAIN ANALYZE
SELECT demoa.a, demob.b
FROM demoab
INNER JOIN demoa ON (demoa.a = demoab.a)
INNER JOIN demob ON (demob.b = demoab.b)
WHERE demoa.a = 100 AND demob.b = 500;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..24.94 rows=1 width=8) (actual time=0.121..0.126 rows=1 loops=1)
-> Nested Loop (cost=0.00..16.66 rows=1 width=8) (actual time=0.089..0.092 rows=1 loops=1)
-> Index Scan using demoab_ab on demoab (cost=0.00..8.38 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=1)
Index Cond: ((a = 100) AND (b = 500))
-> Index Scan using demoa_pkey on demoa (cost=0.00..8.27 rows=1 width=4) (actual time=0.062..0.062 rows=1 loops=1)
Index Cond: (a = 100)
-> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.031 rows=1 loops=1)
Index Cond: (b = 500)
Total runtime: 0.203 ms
(9 rows)
表明在这种情况下,PostgreSQL更喜欢(a,b)上的复合索引。如果您仅加入b
,则情况并非如此:
regress=> EXPLAIN ANALYZE
SELECT demoab.a, demoab.b
FROM demoab
INNER JOIN demob ON (demob.b = demoab.b)
WHERE demob.b = 500;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=19.85..2376.59 rows=967 width=8) (actual time=0.935..3.653 rows=1000 loops=1)
-> Index Scan using demob_pkey on demob (cost=0.00..8.27 rows=1 width=4) (actual time=0.029..0.032 rows=1 loops=1)
Index Cond: (b = 500)
-> Bitmap Heap Scan on demoab (cost=19.85..2358.66 rows=967 width=8) (actual time=0.897..3.123 rows=1000 loops=1)
Recheck Cond: (b = 500)
-> Bitmap Index Scan on demoab_b (cost=0.00..19.61 rows=967 width=0) (actual time=0.436..0.436 rows=1000 loops=1)
Index Cond: (b = 500)
Total runtime: 3.834 ms
(8 rows)
您需要注意的是,任何功能哈希索引在这里都不会有用。因此,如果您需要,我建议在(a,b)
上加上一个复合索引,再加上(b)
上的二级索引。
就唯一性而言,您会发现pg_catalog.pg_stat
能够提供信息。在那里,您将看到 PostgreSQL不会维护单个索引的统计信息,仅在索引的堆列上。在这种情况下:
regress=> select tablename, attname, n_distinct, correlation
from pg_stats where tablename like 'demo%';
tablename | attname | n_distinct | correlation
-------------------+---------+------------+-------------
demoab | a | 1000 | 1
demoab | b | 1000 | 0.0105023
demoab_ab_shifted | expr | 21593 | 0.0175595
demob | b | -1 | 0.021045
demoa | a | -1 | 0.021045
(5 rows)
看起来Pg不会看到散列/组合方法与两个离散的独立值之间存在任何显着差异。
答案 1 :(得分:1)
如果Col1和Col2是独立字段,为什么要组合?它不会节省任何空间。坚持数据库的原子性原则。