我们有两个PostgreSQL表A和B.
CREATE TABLE A
(
aid text PRIMARY KEY NOT NULL,
sessionid text,
b_id text,
...
);
CREATE TABLE B
(
bid text PRIMARY KEY NOT NULL,
sessionids text[] NOT NULL DEFAULT '{}'::text[],
...
);
存在索引:
CREATE INDEX CONCURRENTLY index_A_from_B ON A (b_id, sessionid)
表B和表A之间存在一对多的关系。
表A是巨大的,即数百万条记录和表B可以假设更小(比10e4小)。一对多关系如下:在给定以下条件b_id = bid OR sessionid = ANY (sessionids)
的情况下,表A中的记录将与表B的确切记录匹配。
表A上的触发器函数保证了引用完整性,实际上UPSERT表B的元素(没有记录直接插入表B)。
问题如下:我们如何加快像这样的查询:
WITH
-- CTE to extract relevant records of table B
relevant_B_part as (
SELECT bid, sessionids
FROM B
WHERE t > date('2017-03-9') -- some conditions, but there is a partial index that speeds up this part
),
-- CTE to extract relevant records of table A
relevant_A_part as (
SELECT *
FROM A
WHERE t > date('2017-02-20')
)
SELECT count(*)
FROM
relevant_A_part as rA
INNER JOIN relevant_B_part as rB
ON
rA.b_id = rB.bid
OR rA.sessionid = ANY (rB.sessionids)
此查询基本上采用表B的一些记录并计算表A的匹配列(尽管表A上可能存在一些额外条件)。
是否有一种聪明的方法(索引,外键,...)向PostgreSQL解释如何链接这两个表来加速我们的查询?
原则上我们不需要执行JOIN,而是我们也尝试了横向连接,但到目前为止,我们尝试的所有查询都非常缓慢。
我们不能在表A中添加列,但在表B中可以自由。
解释分析:
Aggregate (cost=13264593.88..13264593.89 rows=1 width=8) (actual time=151434.540..151434.540 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=9677 read=9009, temp written=8769
CTE relevant_b_part
-> Bitmap Heap Scan on public.b (cost=25.50..897.81 rows=387 width=118) (actual time=0.104..0.744 rows=227 loops=1)
Output: b.bid, b.sessionids
Recheck Cond: (b.t > '2017-03-09'::date)
Heap Blocks: exact=122
Buffers: shared hit=127
-> Bitmap Index Scan on idx_b_p (cost=0.00..25.41 rows=387 width=0) (actual time=0.080..0.080 rows=227 loops=1)
Buffers: shared hit=5
CTE relevant_a_part
-> Seq Scan on public.a (cost=0.00..38970.65 rows=971199 width=86) (actual time=0.022..803.094 rows=970142 loops=1)
Output: a.aid, a.sessionid, a.b_id, a.t
Filter: (a.t > '2017-02-20'::date)
Rows Removed by Filter: 662790
Buffers: shared hit=9550 read=9009
-> Nested Loop (cost=0.00..13174318.31 rows=20162846 width=0) (actual time=130814.665..151433.062 rows=5530 loops=1)
Join Filter: ((ra.b_id = rb.bid) OR (ra.sessionid = ANY (rb.sessionids)))
Rows Removed by Join Filter: 220216704
Buffers: shared hit=9677 read=9009, temp written=8769
-> CTE Scan on relevant_a_part ra (cost=0.00..19423.98 rows=971199 width=64) (actual time=0.028..1689.252 rows=970142 loops=1)
Output: ra.aid, ra.sessionid, ra.b_id, ra.t
Buffers: shared hit=9550 read=9009, temp written=8769
-> CTE Scan on relevant_b_part rb (cost=0.00..7.74 rows=387 width=64) (actual time=0.000..0.054 rows=227 loops=970142)
Output: rb.bid, rb.sessionids
Buffers: shared hit=127
Planning time: 0.268 ms
Execution time: 151438.180 ms