SQL合并和清理重复数据

时间:2015-01-05 11:32:42

标签: sql sql-server tsql

我有一个包含3个表的数据库:

---- Contracts --------------------------------------
[PK]ContractID
[FK]DebtorID
    Other (like DateStart, DateEnd, ContractStatus, etc.)
-----------------------------------------------------

---- Debtors ----------------------------------------
[PK]DebtorID
[FK]ContactID
    DebNr
-----------------------------------------------------

---- Contacts ---------------------------------------
[PK]ContactID
    ContactType (0 = person, 1 = company)
    ContactNote
    Name
-----------------------------------------------------

这是一个非常简单的设计。我使用旧数据库并将数据迁移到这个新结构中。唯一的问题是,它需要清理。具有相同名称的债务人不止一次出现,例如:

ContractID: 1 DebtorID: 1 ContactID: 1 DebtorName: Philips
ContractID: 8 DebtorID: 3 ContactID: 9 DebtorName: Philips

显然,这两个债务人是相同的,因此我使用SSIS en T-SQL模糊分组并将相同的ContactID更新为债务人。所以新的数据示例如下所示:

ContractID: 1 DebtorID: 1 ContactID: 1 DebtorName: Philips
ContractID: 8 DebtorID: 3 ContactID: 1 DebtorName: Philips

所以'飞利浦'只在数据库中出现一次,但仍然有两个DebtorID引用相同的' ContactID',这是不理想的。现在我想更新Contracts表,以便它引用相同的DebtorID,因此我可以删除这些倍数。所以我基本上想要实现的是:

ContractID: 1 DebtorID: 1 ContactID: 1 DebtorName: Philips
ContractID: 8 DebtorID: 1 ContactID: 1 DebtorName: Philips

我写了一个T-SQL来实现这一目标,具体如下:

DECLARE @MINID INT
DECLARE @MAXID INT
DECLARE @DEBTORID INT
DECLARE @CONTACTID INT

/* ENTER ALL THE CONTACTID's INTO #TEMP1 WHICH OCCUR MORE THAN ONCE AND ADD A ROWNUMBER TO IT SO WE CAN GO DOWN THE LIST 1 BY 1*/
SELECT Row_number()
         OVER (
           ORDER BY ContactID) AS RNUM,
           ContactID AS ContactID,
           COUNT(*) AS AmountDuplicates
INTO   #TEMP1
FROM Debtors
GROUP BY ContactID
HAVING COUNT(*) > 1

SET @MINID = (SELECT(MIN(RNUM)) FROM #TEMP1)
SET @MAXID = (SELECT(MAX(RNUM)) FROM #TEMP1)

WHILE @MINID <= @MAXID
BEGIN
    /* SELECT THE CONTACTID OF THE ITERATION */
    SELECT @CONTACTID = ContactID
    FROM #TEMP1
    WHERE RNUM = @MINID

    /* SELECT THE LOWEST DEBTORID WHERE THE CONTACTID OCCURS MORE THAN ONCE */
    SELECT TOP(1) @DEBTORID = DebtorID  
    FROM Debtors 
    WHERE ContactID = @CONTACTID
    ORDER BY DebtorID

    /* UPDATE ALL CONTACTS WITH THIS LOWEST DEBTORID WHERE THE CONTACTID OCCURS MORE THAN ONCE */
    UPDATE C
    SET DebtorID = @DEBTORID
    FROM Contracts C
        INNER JOIN Debtors D ON C.DebtorID = D.DebtorID
    WHERE D.ContactID = @CONTACTID

    /* NEXT RNUM FROM #TEMP1 ITERATION */
    SET @MINID = @MINID + 1
END
DROP TABLE #TEMP1

最后,我删除了不再参考合约表的债务人。

DELETE
FROM Debtors
WHERE DebtorID NOT IN (SELECT DebtorID FROM Contracts)

我可以确认这是完成这项工作,但出于好奇,也许有更简单的方法 - 减少操作,减少绕道 - 实现同样的目标?我在MS SQL Server 2008 R2中工作。

0 个答案:

没有答案