Question

我有一个幂等的后台处理任务，它接受一行信息，做一些清理并插入数据库。我的问题是可能会多次处理相同的信息。

为了解决这个问题，我从关于每行数据的信息中创建了一个密钥（哈希），并在索引上创建了一个唯一约束来防止重复。

问题：我通过执行以下操作检查数据中是否已存在数据：

SELECT key FROM items WHERE key IN (key,key,key,key).

我发现这个查询要快一点，但仍然有一些缓慢的响应

SELECT key FROM items WHERE (key = ANY(VALUES(key),(key)))

然后我执行返回的键和我期望的键的交集，只处理尚不存在的数据。

这种情况很有效，直到表格达到1亿多，我可以一次检查100多个密钥，导致大量IO扫描并检索每一行。

我的问题：是否有更有效的方法使用唯一约束和索引检查存在？也许实际上没有进入每一行的东西？

或者，是否有不同的方法可行？简单地尝试插入和捕获唯一约束违规实际上会更快吗？

简化表格定义：

Column         |            Type             |                           Modifiers                           | Storage  | Description
------------------------+-----------------------------+---------------------------------------------------------------+----------+-------------
 id                     | integer                     | not null default nextval('items_id_seq'::regclass) | plain    |
 created_at             | timestamp without time zone | not null                                                      | plain    |
 updated_at             | timestamp without time zone | not null                                                      | plain    |
 key                    | character varying(255)      |                                                               | extended |
 item_attributes        | hstore                      |                                                               | extended |
 item_name              | character varying(255)      |                                                               | plain    |
Indexes:
    "items_pkey" PRIMARY KEY, btree (id)
    "index_items_on_key" UNIQUE, btree (key)

一个查询计划：

Nested Loop  (cost=0.10..108.25 rows=25 width=41) (actual time=0.315..2.169 rows=25 loops=1)
   ->  HashAggregate  (cost=0.10..0.17 rows=25 width=32) (actual time=0.071..0.097 rows=25 loops=1)
         ->  Values Scan on "*VALUES*"  (cost=0.00..0.09 rows=25 width=32) (actual time=0.009..0.033 rows=25 loops=1)
   ->  Index Scan using index_items_on_key on items  (cost=0.00..4.32 rows=1 width=41) (actual time=0.076..0.077 rows=1 loops=25)
         Index Cond: ((key)::text = "*VALUES*".column1)
 Total runtime: 2.406 ms

Answer 1

您不知道数据来自何处以及如何处理。这是通用方法

with to_be_inserted (id, key) as (
    values (1, 'the_hash'), (2, 'another_hash')
)
insert into items (id, key)
select f(id, key)
from to_be_inserted tbi
where not exists (
    select 1
    from items
    where key = tbi.key
);

如果将哈希值存储为bytea而不是text，则可能会显着提高性能，因为它的大小只有索引的一半。并使用较小的md5哈希。

如果无法在SQL中完成处理，则此键搜索可能会更快

with might_be_inserted (key) as (
    values ('hash1'), ('hash2')
)
select key 
from might_be_inserted mbi
where not exists (
    select 1
    from items
    where key = mbi.key
)

确定postgres中是否存在大量行的有效方法

1 个答案: