加载由他人引用的大表,有效

时间:2016-09-20 05:17:12

标签: postgresql amazon-rds postgresql-9.5

我的用例如下:

我有大表users(约2亿行)用户user_id作为主键。使用带有users的外键的其他几个表引用了ON DELETE CASCADE

每天我都要使用大量的csv文件替换users的全部内容。 (请不要问为什么我必须这样做,我只需......)

我的想法是将主键和所有外键设置为DEFERRED,然后在同一事务中删除整个表并使用COPY命令复制所有csvs。预期的结果是所有检查和指数计算都将在交易结束时进行。 但实际上插入过程是超慢的(4小时,如果我插入并放置主键则为10分钟)并且没有外键可以引用可延迟的主要。 由于外键,我无法在插入过程中删除主键。我不想摆脱外键,因为我必须手动模拟ON DELETE CASCADE的行为。

所以基本上我正在寻找一种方法来告诉postgres不关心主键索引或外键检查,直到交易结束。

PS1:我编写了用户表,我实际上处理的是非常不同类型的数据,但它与问题并不相关。

PS2:作为一个粗略的估计,我会说,每天,在200多万条记录中,我删除了10条记录,更新了1百万条记录,并添加了1百万条记录。

2 个答案:

答案 0 :(得分:1)

完全删除+完整插入会导致大量级联FK, 必须推迟DEFERRED, 这将导致DBMS在提交时出现大量后果。

相反,不要{删除+创建}键,而是将它们保持在原位。 此外,请勿触摸不需要触摸的记录。

        -- staging table
CREATE TABLE tmp_users AS SELECT * FROM big_users WHERE 1=0;

COPY TABLE tmp_users (...) FROM '...' WITH CSV;
-- ... and more copying ...
-- ... from more files ...

        -- If this fails, you have a problem!
ALTER TABLE tmp_users
        ADD PRIMARY KEY (id);

-- [EDIT]
-- I added this later, because the user_comments table
-- was not present in the original question.
DELETE FROM user_comments c
WHERE NOT EXISTS (
    SELECT * FROM tmp_users u WHERE u.id = c.user_id
    );
        -- These deletes are allowed to cascade
        -- [we assume that the mport of the CSV files was complete, here ...]
DELETE FROM big_users b
WHERE NOT EXISTS (
        SELECT *
        FROM tmp_users t
        WHERE t.id = b.id
        );

        -- Only update the records that actually **change**
        -- [ updates are expensive in terms of I/O, because they create row-versions
        -- , and the need to delete the old row-versions, afterwards ]
        -- Note that the key (id) does not change, so there will be no cascading.
        -- ------------------------------------------------------------
UPDATE big_users b
SET name_1 = t.name_1
        , name_2 = t.name_2
        , address = t.address
        -- , ... ALL THE COLUMNS here, except the key(s)
FROM tmp_users t
WHERE  t.id = b.id
AND (t.name_1, t.name_2, t.address, ...) -- ALL THE COLUMNS, except the key(s)
        IS DISTINCT FROM
        (b.name_1, b.name_2, b.address, ...)
        ;

        -- Maybe there were some new records in the CSV files. Add them.
INSERT INTO big_users (id,name_1,name_2,address, ...)
SELECT id,name_1,name_2,address, ...
FROM tmp_users t
WHERE NOT EXISTS (
        SELECT *
        FROM big_users x
        WHERE x.id = t.id
        );

答案 1 :(得分:0)

我发现了一个hacky解决方案:

update pg_index set indisvalid = false, indisready=false where indexrelid= 'users_pkey'::regclass;
DELETE FROM users;
COPY TABLE users FROM 'file.csv';
REINDEX INDEX users_pkey;
DELETE FROM user_comments c WHERE NOT EXISTS (SELECT * FROM users u WHERE u.id = c.user_id )
commit;

神奇的肮脏黑客是禁用postgres目录中的主键索引,并在最后强制重新索引(这将覆盖我们更改的内容)。我无法使用ON DELETE CASCADE的外键,因为由于某种原因它会立即执行约束...所以我的外键是ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED而我必须自己执行删除。< / p>

这在我的案例中效果很好,因为在其他表中只有少数用户被引用。

我希望有一个更清洁的解决方案......