使用复杂的JOIN条件加速PostgreSQL查询

时间:2017-03-06 18:45:33

标签: database postgresql

我们有两个PostgreSQL表A和B.

CREATE TABLE A
(
    aid          text PRIMARY KEY NOT NULL,
    sessionid    text,
    b_id         text,
    ...
);

CREATE TABLE B
(
    bid          text PRIMARY KEY NOT NULL,
    sessionids   text[] NOT NULL DEFAULT '{}'::text[],
    ...
);

存在索引:

CREATE INDEX CONCURRENTLY index_A_from_B ON A (b_id, sessionid)

表B和表A之间存在一对多的关系。

表A是巨大的,即数百万条记录和表B可以假设更小(比10e4小)。一对多关系如下:在给定以下条件b_id = bid OR sessionid = ANY (sessionids)的情况下,表A中的记录将与表B的确切记录匹配。

表A上的触发器函数保证了引用完整性,实际上UPSERT表B的元素(没有记录直接插入表B)。

问题

问题如下:我们如何加快像这样的查询:

WITH 
-- CTE to extract relevant records of table B
relevant_B_part as (
    SELECT bid, sessionids
    FROM B
    WHERE t > date('2017-03-9') -- some conditions, but there is a partial index that speeds up this part
),
-- CTE to extract relevant records of table A
relevant_A_part as (
    SELECT *
    FROM A
    WHERE t > date('2017-02-20')
)
SELECT count(*)
FROM 
    relevant_A_part as rA
INNER JOIN relevant_B_part as rB
ON 
        rA.b_id = rB.bid 
    OR  rA.sessionid = ANY (rB.sessionids)

此查询基本上采用表B的一些记录并计算表A的匹配列(尽管表A上可能存在一些额外条件)。

是否有一种聪明的方法(索引,外键,...)向PostgreSQL解释如何链接这两个表来加速我们的查询?

原则上我们不需要执行JOIN,而是我们也尝试了横向连接,但到目前为止,我们尝试的所有查询都非常缓慢。

我们不能在表A中添加列,但在表B中可以自由。

解释分析:

Aggregate  (cost=13264593.88..13264593.89 rows=1 width=8) (actual time=151434.540..151434.540 rows=1 loops=1)
  Output: count(*)
  Buffers: shared hit=9677 read=9009, temp written=8769
  CTE relevant_b_part
    ->  Bitmap Heap Scan on public.b  (cost=25.50..897.81 rows=387 width=118) (actual time=0.104..0.744 rows=227 loops=1)
          Output: b.bid, b.sessionids
          Recheck Cond: (b.t > '2017-03-09'::date)
          Heap Blocks: exact=122
          Buffers: shared hit=127
          ->  Bitmap Index Scan on idx_b_p  (cost=0.00..25.41 rows=387 width=0) (actual time=0.080..0.080 rows=227 loops=1)
                Buffers: shared hit=5
  CTE relevant_a_part
    ->  Seq Scan on public.a  (cost=0.00..38970.65 rows=971199 width=86) (actual time=0.022..803.094 rows=970142 loops=1)
          Output: a.aid, a.sessionid, a.b_id, a.t
          Filter: (a.t > '2017-02-20'::date)
          Rows Removed by Filter: 662790
          Buffers: shared hit=9550 read=9009
  ->  Nested Loop  (cost=0.00..13174318.31 rows=20162846 width=0) (actual time=130814.665..151433.062 rows=5530 loops=1)
        Join Filter: ((ra.b_id = rb.bid) OR (ra.sessionid = ANY (rb.sessionids)))
        Rows Removed by Join Filter: 220216704
        Buffers: shared hit=9677 read=9009, temp written=8769
        ->  CTE Scan on relevant_a_part ra  (cost=0.00..19423.98 rows=971199 width=64) (actual time=0.028..1689.252 rows=970142 loops=1)
              Output: ra.aid, ra.sessionid, ra.b_id, ra.t
              Buffers: shared hit=9550 read=9009, temp written=8769
        ->  CTE Scan on relevant_b_part rb  (cost=0.00..7.74 rows=387 width=64) (actual time=0.000..0.054 rows=227 loops=970142)
              Output: rb.bid, rb.sessionids
              Buffers: shared hit=127
Planning time: 0.268 ms
Execution time: 151438.180 ms

0 个答案:

没有答案