我有一个包含数十万条记录的客户表。有很多不同程度的重复项。我正在尝试确定重复记录的可能性,而重复记录的可能性为水平。
我寻找重复项,并将它们放入中间表中,并列出可能性,表名和客户编号。 中间表
CREATE TABLE DataCheck (
id int identity(1,1),
reason varchar(100) DEFAULT NULL,
tableName varchar(100) DEFAULT NULL,
tableID varchar(100) DEFAULT NULL
)
这是我用于识别和插入的代码:
-- Match on Company, Contact, Address, City, and Phone
-- DUPE
INSERT INTO DataCheck
SELECT 'Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
fname,
lname,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, fname, lname, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.company
在此示例中,它将插入ID 101、102
问题是当我执行下一遍时:
-- Match on Company, Address, City, Phone (Diff Contacts)
-- LIKELY DUPE
INSERT INTO DataCheck
SELECT 'Likely Duplicate','CUSTOMER',tcd.uid
FROM #tmpCoreData tcd
INNER JOIN
(SELECT
company,
add1,
city,
phone1,
COUNT(*) AS count
FROM #tmpCoreData
WHERE company <> ''
GROUP BY company, add1, city, phone1
HAVING COUNT(*) > 1) dl
ON dl.company = tcd.company
ORDER BY tcd.companyc
然后此通行证将插入101、102和103。 下一遍将手机掉落,因此它将插入101、102、103、104 下一遍将只查找会插入所有5的公司。
我现在在中间表中有14个条目,可以记录5条记录。
如何添加排除项,以便第二遍密码在相同的公司,地址,城市,电话上分组,但fname和lname不同。然后只能插入101和103
我考虑添加一个NOT IN(从DataCheck中选择SELECT tableID)以确保不会多次添加ID,但是在第4遍的第3遍,它可能会找到一个重复项,并在该行的重复项后输入700条记录,因此您会失去它的欺骗性。
我的输出使用:
SELECT
dc.reason,
dc.tableName,
tcd.*
FROM DataCheck dc
INNER JOIN #tmpCoreData tcd
ON tcd.uid = dc.tableID
ORDER BY dc.id
答案 0 :(得分:2)
我将挑战您对问题的看法,而是建议您计算一个简单的“置信度”,这也将帮助您大大简化结果表:
WITH FirstCompany AS (SELECT custNo, company, fname, lname, add1, city, phone1
FROM(SELECT custNo, company, fname, lname, add1, city, phone1,
ROW_NUMBER() OVER(PARTITION BY company ORDER BY custNo) AS ordering
FROM CoreData) FC
WHERE ordering = 1)
SELECT RankMapping.description, Duplicate.custNo, Duplicate.company, Duplicate.fname, Duplicate.lname, Duplicate.add1, Duplicate.city, Duplicate.phone1
FROM (SELECT FirstCompany.custNo AS originalCustNo, Duplicate.*,
CASE WHEN FirstCompany.custNo = Duplicate.custNo THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.fname = Duplicate.fname AND FirstCompany.lname = Duplicate.lname THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.add1 = Duplicate.add1 AND FirstCompany.city = Duplicate.city THEN 1 ELSE 0 END
+ CASE WHEN FirstCompany.phone1 = Duplicate.phone1 THEN 1 ELSE 0 END
AS ranking
FROM FirstCompany
JOIN CoreData Duplicate
ON Duplicate.custNo >= FirstCompany.custNo
AND Duplicate.company = FirstCompany.company) Duplicate
JOIN (VALUES (4, 'original'),
(3, 'duplicate'),
(2, 'likely dupe'),
(1, 'possible dupe'),
(0, 'not likely dupe')) RankMapping(score, description)
ON RankMapping.score = Duplicate.ranking
ORDER BY Duplicate.originalCustNo, Duplicate.ranking DESC
...生成如下所示的结果:
| description | custNo | company | fname | lname | add1 | city | phone1 |
|-----------------|--------|----------|---------|--------|--------------|--------------|------------|
| original | 101 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| duplicate | 102 | ACME INC | JOHN | DOE | 123 ACME ST | LOONEY HILLS | 1231234567 |
| likely dupe | 103 | ACME INC | JANE | SMITH | 123 ACME ST | LOONEY HILLS | 1231234567 |
| possible dupe | 104 | ACME INC | BOB | DOLE | 123 ACME ST | LOONEY HILLS | 4564567890 |
| not likely dupe | 105 | ACME INC | JESSICA | RABBIT | 456 ROGER LN | WARNER | 4564567890 |
此代码无根据地假定最小的custNo
是“原始”,并假定匹配项仅等于那个匹配项,但是也完全有可能获得其他匹配项(只是使CTE中的子查询嵌套) ,然后删除行号。