我有一个taccounts
表,其中包含account_id(PK)
,login_name
,password
,last_login
等列。现在我必须根据新的业务逻辑删除一些重复的条目。
因此,重复的帐户将使用相同的email
或相同(login_name
& password
)。必须保留具有最新登录名的帐户。
以下是我的尝试(某些电子邮件值为空且空白)
DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0
GROUP BY lower(trim(both ' ' from email)))
同样适用于login_name
和password
DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)
有没有更好的方法或任何方法来合并这两个单独的查询?
另外一些表有account_id
作为外键。 如何为这些表更新此更改?`
我正在使用PostgreSQL 9.2.1
编辑:某些电子邮件值为空,其中一些为空('')。所以,如果两个帐户有不同的login_name&密码及其电子邮件为空或空白,则必须将其视为两个不同的帐户。
答案 0 :(得分:1)
幸运的是你正在运行PostgreSQL。 DISTINCT ON
应该比较容易:
由于您要删除大部分行(~90%dupes)并且表格很可能很容易适合RAM,我选择了这条路线:
SELECT
幸存的行进入临时表。DELETE
基表中的所有行。INSERT
幸存者。CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM (
SELECT DISTINCT ON (email) *
FROM taccounts
ORDER BY email, last_login DESC
) sub
ORDER BY login_name, password, last_login DESC;
有关DISTINCT ON
的更多信息:
要删除两个不同条件的重复项,我只使用子查询,一个接一个地应用这两个规则。第一步使用最新的last_login
保留帐户,因此这是“可序列化的”。
检查结果并测试合理性。
SELECT * FROM tmp;
会话结束时会自动删除临时表。在pgAdmin(您似乎正在使用)中,只要在您创建临时表的编辑器窗口打开,会话就会存在。
SELECT *
FROM taccounts t
WHERE NOT EXISTS (
SELECT 1
FROM taccounts t1
WHERE (
NULLIF(t1.email, '') = t.email OR
(NULLIF(t1.login_name, ''), NULLIF(t1.password, ''))
= (t.login_name, t.password)
)
AND (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
);
这不会将NULL
或emtpy字符串(''
)视为任何“重复”列中的相同内容。
行表达式(t1.last_login, t1.account_id)
负责两个欺骗可以共享相同last_login
的可能性。在这种情况下,我选择具有较大account_id
的那个 - 这是唯一的,因为它是PK。
SELECT c.confrelid::regclass::text AS referenced_table
,c.conname AS fk_name
,pg_get_constraintdef(c.oid) AS fk_definition
FROM pg_attribute a
JOIN pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE c.confrelid = 'taccounts '::regclass -- (schema-qualified) table name
AND c.contype = 'f'
ORDER BY 1, contype DESC;
仅构建外键的第一列。更多关于此事:
或者,您可以在选择Dependents
后检查pgAdmin对象浏览器右侧窗口中的taccounts
骑手。
如果您的表引用taccounts
(传入外键 taccounts
),您将需要更新所有这些字段,在删除欺骗之前。
将所有这些重新路由到新的主行:
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t1 USING (email)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
UPDATE referencing_tbl r
SET referencing_column = tmp.reference_column
FROM tmp
JOIN taccounts t2 USING (login_name, password)
WHERE r.referencing_column = t1.referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;
现在,骗局没有更多链接。进去杀人。
ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;
我在操作期间禁用所有触发器。这避免了在操作期间检查参照完整性。一旦你重新激活触发器,一切都应该没问题。我们负责上面的所有传入 FK。 传出 FK保证是合理的,因为您没有并发访问权限,并且之前已经存在所有值。
答案 1 :(得分:1)
除了Edwin的优秀答案之外,在中间链接表中创建将旧密钥与新密钥相关联通常很有用。
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
( account_id SERIAL PRIMARY KEY
, login_name varchar
, email varchar
, last_login TIMESTAMP
);
-- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
, email = 'Joe' || (account_id %9)::text || '@somedomain.tld'
;
SELECT * FROM taccounts;
--
-- Create (temp) table linking old id <--> new id
-- After inspection this table can be used as a source for the FK updates
-- and for the final delete.
--
CREATE TABLE update_ids AS
WITH pairs AS (
SELECT one.account_id AS old_id
, two.account_id AS new_id
FROM taccounts one
JOIN taccounts two ON two.last_login > one.last_login
AND ( two.email = one.email OR two.login_name = one.login_name)
)
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
SELECT * FROM pairs nx
WHERE nx.old_id = pp.old_id
AND nx.new_id > pp.new_id
)
;
SELECT * FROM update_ids
;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM update_ids ex
WHERE ex.old_id = del.account_id
);
SELECT * FROM taccounts;
另一种实现相同目的的方法是将一个指向首选键的指针添加到表本身,并将其用于更新和删除。
ALTER TABLE taccounts
ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
;
-- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
-- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
SELECT * FROM taccounts nx
WHERE nx.login_name = dst.login_name
AND nx.email IS NOT NULL
AND nx.last_login > src.last_login
);
SELECT * FROM taccounts ORDER BY account_id;
UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
;
DELETE FROM taccounts del
WHERE EXISTS (
SELECT * FROM taccounts ex
WHERE ex.account_id = del.better_id
);
SELECT * FROM taccounts ORDER BY account_id;