根据多个条件从表中删除重复项并保留到其他表

时间:2013-03-30 10:49:55

标签: sql database postgresql duplicate-removal postgresql-9.2

我有一个taccounts表,其中包含account_id(PK)login_namepasswordlast_login等列。现在我必须根据新的业务逻辑删除一些重复的条目。 因此,重复的帐户将使用相同的email 相同(login_name& password)。必须保留具有最新登录名的帐户。

以下是我的尝试(某些电子邮件值为空且空白)

DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 
GROUP BY lower(trim(both ' ' from email)))

同样适用于login_namepassword

DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)

有没有更好的方法或任何方法来合并这两个单独的查询?

另外一些表有account_id作为外键。 如何为这些表更新此更改?` 我正在使用PostgreSQL 9.2.1

编辑:某些电子邮件值为空,其中一些为空('')。所以,如果两个帐户有不同的login_name&密码及其电子邮件为空或空白,则必须将其视为两个不同的帐户。

2 个答案:

答案 0 :(得分:1)

幸运的是你正在运行PostgreSQL。 DISTINCT ON应该比较容易:

由于您要删除大部分行(~90%dupes)并且表格很可能很容易适合RAM,我选择了这条路线:

  1. SELECT幸存的行进入临时表。
  2. 重新引用列。
  3. DELETE基表中的所有行。
  4. 重新 - INSERT幸存者。
  5. 提炼剩余的行

    CREATE TEMP TABLE tmp AS
    SELECT DISTINCT ON (login_name, password) *
    FROM  (
       SELECT DISTINCT ON (email) *
       FROM   taccounts
       ORDER  BY email, last_login DESC
       ) sub
    ORDER  BY login_name, password, last_login DESC;
    

    有关DISTINCT ON的更多信息:

    要删除两个不同条件的重复项,我只使用子查询,一个接一个地应用这两个规则。第一步使用最新的last_login保留帐户,因此这是“可序列化的”。

    检查结果并测试合理性。

    SELECT * FROM tmp;
    

    会话结束时会自动删除临时表。在pgAdmin(您似乎正在使用)中,只要在您创建临时表的编辑器窗口打开,会话就会存在。

    替换查询“重复”

    的更新定义
    SELECT *
    FROM   taccounts t
    WHERE  NOT EXISTS (
       SELECT 1
       FROM   taccounts t1
       WHERE (
               NULLIF(t1.email, '') = t.email OR 
               (NULLIF(t1.login_name, ''), NULLIF(t1.password, ''))
             = (t.login_name, t.password)
             )
       AND   (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
       );
    

    这不会将NULL或emtpy字符串('')视为任何“重复”列中的相同内容。

    行表达式(t1.last_login, t1.account_id)负责两个欺骗可以共享相同last_login的可能性。在这种情况下,我选择具有较大account_id的那个 - 这是唯一的,因为它是PK。

    如何识别所有传入的FK

    SELECT c.confrelid::regclass::text AS referenced_table
          ,c.conname AS fk_name
          ,pg_get_constraintdef(c.oid) AS fk_definition
    FROM   pg_attribute a 
    JOIN   pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
    WHERE  c.confrelid = 'taccounts '::regclass   -- (schema-qualified) table name
    AND    c.contype  = 'f'
    ORDER  BY 1, contype DESC;
    

    仅构建外键的第一列。更多关于此事:

    或者,您可以在选择Dependents后检查pgAdmin对象浏览器右侧窗口中的taccounts骑手。

    重新转到新主人

    如果您的表引用taccounts传入外键 taccounts),您将需要更新所有这些字段,删除欺骗之前。
    将所有这些重新路由到新的主行:

    UPDATE referencing_tbl r
    SET    referencing_column = tmp.reference_column
    FROM   tmp
    JOIN   taccounts t1 USING (email)
    WHERE  r.referencing_column = t1.referencing_column
    AND    referencing_column IS DISTINCT FROM tmp.reference_column;
    
    UPDATE referencing_tbl r
    SET    referencing_column = tmp.reference_column
    FROM   tmp
    JOIN   taccounts t2 USING (login_name, password)
    WHERE  r.referencing_column = t1.referencing_column
    AND    referencing_column IS DISTINCT FROM tmp.reference_column;
    

    进入杀人

    现在,骗局没有更多链接。进去杀人。

    ALTER TABLE taccounts DISABLE TRIGGER ALL;
    DELETE FROM taccounts;
    VACUUM taccounts;
    INSERT INTO taccounts
    SELECT * FROM tmp;
    ALTER TABLE taccounts ENABLE TRIGGER ALL;
    

    我在操作期间禁用所有触发器。这避免了在操作期间检查参照完整性。一旦你重新激活触发器,一切都应该没问题。我们负责上面的所有传入 FK。 传出 FK保证是合理的,因为您没有并发访问权限,并且之前已经存在所有值。

答案 1 :(得分:1)

除了Edwin的优秀答案之外,在中间链接表中创建将旧密钥与新密钥相关联通常很有用。

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
        ( account_id SERIAL PRIMARY KEY
        , login_name varchar
        , email varchar
        , last_login TIMESTAMP
        );
    -- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
        ;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
        , email = 'Joe' || (account_id %9)::text || '@somedomain.tld'
        ;

SELECT * FROM taccounts;

        --
        -- Create (temp) table linking old id <--> new id
        -- After inspection this table can be used as a source for the FK updates
        -- and for the final delete.
        --
CREATE TABLE update_ids AS
WITH pairs AS (
        SELECT one.account_id AS old_id
        , two.account_id AS new_id
        FROM taccounts one
        JOIN taccounts two ON two.last_login > one.last_login
                AND ( two.email = one.email OR two.login_name = one.login_name)
        )
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
        SELECT * FROM pairs nx
        WHERE nx.old_id = pp.old_id
        AND nx.new_id > pp.new_id
        )
        ;

SELECT * FROM update_ids
        ;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
        ;
DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM update_ids ex
        WHERE ex.old_id = del.account_id
        );

SELECT * FROM taccounts;

另一种实现相同目的的方法是将一个指向首选键的指针添加到表本身,并将其用于更新和删除。

ALTER TABLE taccounts
        ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
        ;

   -- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

    -- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

SELECT * FROM taccounts ORDER BY account_id;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
        ;

DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM taccounts ex
        WHERE ex.account_id = del.better_id
        );
SELECT * FROM taccounts ORDER BY account_id;