包含排除项的多遍复制识别

时间:2019-10-11 22:09:28

标签: sql sql-server duplicates

我有一个包含数十万条记录的客户表。有很多不同程度的重复项。我正在尝试确定重复记录的可能性,而重复记录的可能性为水平。

我的源表有7个字段,看起来像这样: Sample Source Table

我寻找重复项,并将它们放入中间表中,并列出可能性,表名和客户编号。 中间表

CREATE TABLE DataCheck (
    id          int             identity(1,1),
    reason      varchar(100)    DEFAULT NULL,
    tableName   varchar(100)    DEFAULT NULL,
    tableID     varchar(100)    DEFAULT NULL
)

这是我用于识别和插入的代码:

-- Match on Company, Contact, Address, City, and Phone
-- DUPE
INSERT INTO DataCheck 
    SELECT 'Duplicate','CUSTOMER',tcd.uid
      FROM #tmpCoreData tcd
        INNER JOIN 
        (SELECT
            company, 
            fname,
            lname,
            add1,
            city,
            phone1,
            COUNT(*) AS count
          FROM #tmpCoreData
          WHERE company <> ''
          GROUP BY company, fname, lname, add1, city, phone1
          HAVING COUNT(*) > 1) dl 
        ON dl.company = tcd.company
    ORDER BY tcd.company

在此示例中,它将插入ID 101、102

问题是当我执行下一遍时:

-- Match on Company, Address, City, Phone (Diff Contacts)
-- LIKELY DUPE
INSERT INTO DataCheck 
    SELECT 'Likely Duplicate','CUSTOMER',tcd.uid
      FROM #tmpCoreData tcd
        INNER JOIN 
        (SELECT
            company, 
            add1,
            city,
            phone1,
            COUNT(*) AS count
          FROM #tmpCoreData
          WHERE company <> ''
          GROUP BY company, add1, city, phone1
          HAVING COUNT(*) > 1) dl 
        ON dl.company = tcd.company
    ORDER BY tcd.companyc

然后此通行证将插入101、102和103。 下一遍将手机掉落,因此它将插入101、102、103、104 下一遍将只查找会插入所有5的公司。

我现在在中间表中有14个条目,可以记录5条记录。

如何添加排除项,以便第二遍密码在相同的公司,地址,城市,电话上分组,但fname和lname不同。然后只能插入101和103

我考虑添加一个NOT IN(从DataCheck中选择SELECT tableID)以确保不会多次添加ID,但是在第4遍的第3遍,它可能会找到一个重复项,并在该行的重复项后输入700条记录,因此您会失去它的欺骗性。

我的输出使用:

SELECT 
    dc.reason,
    dc.tableName,
    tcd.*
  FROM DataCheck dc
  INNER JOIN #tmpCoreData tcd
    ON tcd.uid = dc.tableID
ORDER BY dc.id

看起来像这样,这有点令人困惑: Output Table

1 个答案:

答案 0 :(得分:2)

我将挑战您对问题的看法,而是建议您计算一个简单的“置信度”,这也将帮助您大大简化结果表:

WITH FirstCompany AS (SELECT custNo, company, fname, lname, add1, city, phone1
                      FROM(SELECT custNo, company, fname, lname, add1, city, phone1, 
                                  ROW_NUMBER() OVER(PARTITION BY company ORDER BY custNo) AS ordering
                           FROM CoreData) FC
                      WHERE ordering = 1)

SELECT RankMapping.description, Duplicate.custNo, Duplicate.company, Duplicate.fname, Duplicate.lname, Duplicate.add1, Duplicate.city, Duplicate.phone1
FROM (SELECT FirstCompany.custNo AS originalCustNo, Duplicate.*, 
             CASE WHEN FirstCompany.custNo = Duplicate.custNo THEN 1 ELSE 0 END 
             + CASE WHEN FirstCompany.fname = Duplicate.fname AND FirstCompany.lname = Duplicate.lname THEN 1 ELSE 0 END 
             + CASE WHEN FirstCompany.add1 = Duplicate.add1  AND FirstCompany.city = Duplicate.city THEN 1 ELSE 0 END
             + CASE WHEN FirstCompany.phone1 = Duplicate.phone1 THEN 1 ELSE 0 END
             AS ranking
      FROM FirstCompany
      JOIN CoreData Duplicate
        ON Duplicate.custNo >= FirstCompany.custNo
           AND Duplicate.company = FirstCompany.company) Duplicate
JOIN (VALUES (4, 'original'),
             (3, 'duplicate'),
             (2, 'likely dupe'),
             (1, 'possible dupe'),
             (0, 'not likely dupe')) RankMapping(score, description)
 ON RankMapping.score = Duplicate.ranking
ORDER BY Duplicate.originalCustNo, Duplicate.ranking DESC

SQL Fiddle Example

...生成如下所示的结果:

| description     | custNo | company  | fname   | lname  | add1         | city         | phone1     |
|-----------------|--------|----------|---------|--------|--------------|--------------|------------|
| original        | 101    | ACME INC | JOHN    | DOE    | 123 ACME ST  | LOONEY HILLS | 1231234567 |
| duplicate       | 102    | ACME INC | JOHN    | DOE    | 123 ACME ST  | LOONEY HILLS | 1231234567 |
| likely dupe     | 103    | ACME INC | JANE    | SMITH  | 123 ACME ST  | LOONEY HILLS | 1231234567 |
| possible dupe   | 104    | ACME INC | BOB     | DOLE   | 123 ACME ST  | LOONEY HILLS | 4564567890 |
| not likely dupe | 105    | ACME INC | JESSICA | RABBIT | 456 ROGER LN | WARNER       | 4564567890 |

此代码无根据地假定最小的custNo是“原始”,并假定匹配项仅等于那个匹配项,但是也完全有可能获得其他匹配项(只是使CTE中的子查询嵌套) ,然后删除行号。