我需要实现一个查询(或者一个存储过程),它将在我的一个表中执行数据的软重复数据删除。如果任何两个记录足够相似,我需要“压缩”它们:停用一个并更新另一个。
相似性基于分数。分数按以下方式计算:
检查完所有需要的值对后:
id
的{{1}}列附加到较新的记录。我如何在SQL中解决此任务?
有问题的表名为duplicate_ids
。人员记录由不同的管理员输入。重复数据删除过程的存在是为了确保系统中不存在两个相同的人。
任务的动机很简单:表现。
现在,解决方案是通过脚本语言实现的,通过几个低于标准的SQL查询和逻辑。但是,预计数据量将增长到数千万条记录,并且脚本最终会变得非常慢(它应该每晚都通过cron运行)。
我正在使用postgresql。
答案 0 :(得分:1)
重复数据删除通常是一个棘手的问题。
我发现了这个:https://github.com/dedupeio/dedupe。这里有一个很好的描述:https://dedupe.io/documentation/how-it-works.html。
我要去探索dedupe
。我不会尝试在SQL中实现它。
答案 1 :(得分:0)
如果我告诉你,这可能有所帮助。
您可以使用 PostgreSQL窗口函数来获取所有重复内容并使用"权重" 来确定哪些记录是重复的,这样您就可以做任何事情了和他们一样。
以下是一个例子:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
您可以将此查询添加到存储过程,以便您可以随时运行它。希望它有所帮助。