Question

我需要实现一个查询（或者一个存储过程），它将在我的一个表中执行数据的软重复数据删除。如果任何两个记录足够相似，我需要“压缩”它们：停用一个并更新另一个。

相似性基于分数。分数按以下方式计算：

从两个记录中获取A列的值，
价值相等？将A1添加到乐谱中
价值不相等？从分数中减去A2，
转到下一栏。

检查完所有需要的值对后：

得分高于X？
是 - 记录重复，将旧记录标记为“重复”;将id的{{1}}列附加到较新的记录。
不 - 什么也不做。

我如何在SQL中解决此任务？

有问题的表名为duplicate_ids。人员记录由不同的管理员输入。重复数据删除过程的存在是为了确保系统中不存在两个相同的人。

任务的动机很简单：表现。

现在，解决方案是通过脚本语言实现的，通过几个低于标准的SQL查询和逻辑。但是，预计数据量将增长到数千万条记录，并且脚本最终会变得非常慢（它应该每晚都通过cron运行）。

我正在使用postgresql。

Answer 1

重复数据删除通常是一个棘手的问题。

我发现了这个：https://github.com/dedupeio/dedupe。这里有一个很好的描述：https://dedupe.io/documentation/how-it-works.html。

我要去探索dedupe。我不会尝试在SQL中实现它。

Answer 2

如果我告诉你，这可能有所帮助。

您可以使用 PostgreSQL窗口函数来获取所有重复内容并使用＆＃34;权重＆＃34; 来确定哪些记录是重复的，这样您就可以做任何事情了和他们一样。

以下是一个例子：

-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);

-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');

-- SELECT * FROM test
-- id  | colA  | colB  | colC  | creation_date
-- ----+-------+-------+-------+---------------
-- 1   | A     | B     | C     | 2017-05-01
-- 2   | D     | E     | F     | 2017-06-01
-- 3   | A     | B     | D     | 2017-08-01   <-- Duplicate A,B
-- 4   | A     | B     | R     | 2017-09-01   <-- Duplicate A,B
-- 5   | C     | J     | K     | 2017-09-01
-- 6   | A     | C     | J     | 2017-10-01
-- 7   | C     | W     | K     | 2017-10-01   <-- Duplicate C,K
-- 8   | R     | T     | Y     | 2017-11-01

-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):

-- third, you select the id of the duplicates
SELECT id
FROM
    (
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
     SELECT 
     id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
     CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
     CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
     CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
    FROM
        (
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
        SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
            row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
            row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
            row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
        FROM test ORDER BY id
        ) count_column_duplicates
    ) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1

-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7

您可以将此查询添加到存储过程，以便您可以随时运行它。希望它有所帮助。

寻找＆amp;更新重复行

2 个答案: