优化查询,该查询将具有数百万行的表与其自身进行比较

时间:2019-10-07 05:04:20

标签: postgresql query-optimization query-performance

我可以使用一些帮助来优化查询,该查询将单个表中的行与数百万个条目进行比较。这是表格的定义:

CREATE TABLE IF NOT EXISTS data.row_check (
    id         uuid NOT NULL DEFAULT NULL,
    version    int8 NOT NULL DEFAULT NULL,
    row_hash   int8 NOT NULL DEFAULT NULL,
    table_name text NOT NULL DEFAULT NULL,

CONSTRAINT row_check_pkey
    PRIMARY KEY (id, version)
);

我正在重做我们的推送代码,并拥有一个测试平台,该测试平台在大约20个表上具有数百万条记录。我运行测试,获取行数,并可以发现某些插入代码已更改的时间。下一步是对每一行进行校验和,然后比较各行的代码版本之间的差异。像这样:

-- Run my test of "version 0" of the push code, the base code I'm refactoring.  
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
            SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log' 
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
                row_hash   = EXCLUDED.row_hash,
                table_name = EXCLUDED.table_name;

truncate table record_changes_log;

-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.

INSERT INTO row_check (id,version,row_hash,table_name)
            SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log' 
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
                row_hash   = EXCLUDED.row_hash,
                table_name = EXCLUDED.table_name;

对于record_changes_log或我正在检查的任何其他表中的每一行,在row_check中得到两行。对于两次record_changes_log,我最终在row_check中获得了860万行。他们看起来像这样:

id                                      version row_hash    table_name
e6218751-ab78-4942-9734-f017839703f6    0   -142492569  record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1    0   -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d    0   -323725113  record_changes_log
1935590c-8d22-4baf-85ba-00b563022983    0   -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2    0   -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0    0   1857654119  record_changes_log

asked yesterday为比较查询提供了一些帮助,它导致了这一问题:

 select v0.table_name,
        v0.id,
        v0.row_hash as v0,
        v1.row_hash as v1   

   from row_check v0 
   join row_check v1 on v0.id = v1.id  and
        v0.version = 0 and
        v1.version  = 1 and
        v0.row_hash <> v1.row_hash

可以,但是现在我希望对其进行优化。作为实验,我将数据聚集在版本上,然后构建一个BRIN索引,如下所示:

drop index if exists row_check_version_btree;
create index row_check_version_btree
          on row_check
        using btree(version);

cluster row_check using row_check_version_btree;    
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.

drop index if exists row_check_version_brin;
create index row_check_version_brin
          on row_check
        using brin(row_hash);

vacuum analyze row_check;       

我通过解释分析运行了查询并得到了这个信息:

Merge Join  (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
  Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
  Inner Unique: true
  Merge Cond: (v0.id = v1.id)
  Join Filter: (v0.row_hash <> v1.row_hash)
  Rows Removed by Join Filter: 4318290
  Buffers: shared hit=8679005 read=42511
  ->  Index Scan using row_check_pkey on ascendco.row_check v0  (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
        Output: v0.id, v0.version, v0.row_hash, v0.table_name
        Index Cond: (v0.version = 0)
        Buffers: shared hit=4360752
  ->  Index Scan using row_check_pkey on ascendco.row_check v1  (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
        Output: v1.id, v1.version, v1.row_hash, v1.table_name
        Index Cond: (v1.version = 1)
        Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms

...我并没有真正从中得到正确的主意...所以我再次将其运行到JSON并将结果输入到这个出色的计划可视化器中:

http://tatiyants.com/pev/#/plans

query plan node map

这里的技巧是正确的,最高节点估计是错误的。结果是10行,估计约为443,757行。

我希望了解有关优化这种事情的更多信息,并且此查询似乎是一个很好的机会。我对可能有帮助的想法很多:

-CREATE STATISTICS
-重做查询以移动到哪里比较?
-使用更好的索引?我确实在版本上尝试了GIN索引和笔直的B树,但是两者都不如。
-重做row_check格式以将两个哈希移动到同一行中,而不是将它们分成两行,在插入/更新时进行比较,标记不匹配,并为不匹配的值添加部分索引。

当然,即使尝试索引只有两个值(在上述情况下为0和1)的东西也很有趣。实际上,布尔值是否有任何巧妙的窍门?我将始终比较两个版本,因此我可以表达的“旧”和“新”使生活变得最好。我知道Postgres在搜索/合并(?)时仅在内部具有位图索引,并且它没有位图类型索引。是否会有某种INTERSECT可能会有所帮助?我不知道Postgres如何在内部实现集合数学运算符。

感谢您提供有关如何重新考虑此数据或查询以使其更快地用于涉及数百万或数千万行的比较的建议。

1 个答案:

答案 0 :(得分:1)

我将为自己的问题添加一个答案,但对其他人必须说的仍然很感兴趣。在写出最初的问题的过程中,我意识到可能需要重新设计。这取决于我的计划,一次只能比较两个版本。这是一个很好的解决方案,但是在其他情况下,它将无法正常工作。无论如何,这是一个略有不同的表格设计,可将两个结果折叠成一行:

DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
    id           uuid NOT NULL DEFAULT NULL,
    hash_1       int8,    -- Want NULL to defer calculating hash comparison until after both hashes are entered.
    hash_2       int8,    -- Ditto
    hashes_match boolean, -- Likewise 
    table_name   text NOT NULL DEFAULT NULL,

CONSTRAINT row_compare_pkey
    PRIMARY KEY (id)
);

希望下面的表达式索引应该很小,因为我不应该有任何不匹配的条目:

CREATE INDEX row_compare_fail ON row_compare (hashes_match)
    WHERE hashes_match = false;

一旦同时提供了hash_1和hash_2,下面的触发器将进行列计算:

-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
  RETURNS trigger AS 

$BODY$
BEGIN

    IF  NEW.hash_1 = NULL OR 
        NEW.hash_2 = NULL THEN
        RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.

    ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
       NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
      RETURN NEW;     -- important!
   END IF;
END;

$BODY$
LANGUAGE plpgsql;

这现在增加了4.3M行,而不是860万行:

-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
            SELECT id, hashtext(record_changes_log::text),'record_changes_log'
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
                hash_1   = EXCLUDED.hash_1,
                table_name = EXCLUDED.table_name;

-- I'll truncate the record_changes_log and push my sample data again here.

-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
            SELECT id, hashtext(record_changes_log::text),'record_changes_log'
            FROM record_changes_log

            ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
                hash_2   = EXCLUDED.hash_2,
                table_name = EXCLUDED.table_name;

现在结果很容易找到:

select * from row_compare where hashes_match = false;

这会将查询时间从大约17秒更改为大约24毫秒。