寻找&更新重复行

时间:2018-01-16 15:19:12

标签: sql algorithm postgresql duplicates

我需要实现一个查询(或者一个存储过程),它将在我的一个表中执行数据的软重复数据删除。如果任何两个记录足够相似,我需要“压缩”它们:停用一个并更新另一个。

相似性基于分数。分数按以下方式计算:

  1. 从两个记录中获取A列的值,
  2. 价值相等?将A1添加到乐谱中
  3. 价值不相等?从分数中减去A2,
  4. 转到下一栏。
  5. 检查完所有需要的值对后:

    1. 得分高于X?
    2. 是 - 记录重复,将记录标记为“重复”;将id的{​​{1}}列附加到较新的记录。
    3. 不 - 什么也不做。
    4. 我如何在SQL中解决此任务?

      有问题的表名为duplicate_ids。人员记录由不同的管理员输入。重复数据删除过程的存在是为了确保系统中不存在两个相同的人。

      任务的动机很简单:表现。

      现在,解决方案是通过脚本语言实现的,通过几个低于标准的SQL查询和逻辑。但是,预计数据量将增长到数千万条记录,并且脚本最终会变得非常慢(它应该每晚都通过cron运行)。

      我正在使用postgresql。

2 个答案:

答案 0 :(得分:1)

重复数据删除通常是一个棘手的问题。

我发现了这个:https://github.com/dedupeio/dedupe。这里有一个很好的描述:https://dedupe.io/documentation/how-it-works.html

我要去探索dedupe。我不会尝试在SQL中实现它。

答案 1 :(得分:0)

如果我告诉你,这可能有所帮助。

您可以使用 PostgreSQL窗口函数来获取所有重复内容并使用"权重" 来确定哪些记录是重复的,这样您就可以做任何事情了和他们一样。

以下是一个例子:

-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);

-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');

-- SELECT * FROM test
-- id  | colA  | colB  | colC  | creation_date
-- ----+-------+-------+-------+---------------
-- 1   | A     | B     | C     | 2017-05-01
-- 2   | D     | E     | F     | 2017-06-01
-- 3   | A     | B     | D     | 2017-08-01   <-- Duplicate A,B
-- 4   | A     | B     | R     | 2017-09-01   <-- Duplicate A,B
-- 5   | C     | J     | K     | 2017-09-01
-- 6   | A     | C     | J     | 2017-10-01
-- 7   | C     | W     | K     | 2017-10-01   <-- Duplicate C,K
-- 8   | R     | T     | Y     | 2017-11-01

-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):

-- third, you select the id of the duplicates
SELECT id
FROM
    (
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
     SELECT 
     id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
     CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
     CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
     CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
    FROM
        (
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
        SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
            row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
            row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
            row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
        FROM test ORDER BY id
        ) count_column_duplicates
    ) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1

-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7

您可以将此查询添加到存储过程,以便您可以随时运行它。希望它有所帮助。