SQL消除重复项,同时合并其他表

时间:2019-03-06 16:17:27

标签: sql duplicates merging-data

我有两个表,<地址>地址和另一个表联系联系人具有一个SUPERID,它是它们所属的地址的ID。 我想在地址表中标识重复项(名称,名字和生日相同),并将这些重复项的联系人合并到最新的地址(最新的 DATECREATE 或最高的 ID 地址)。 之后,其他重复项将被删除。

我合并联系人的方法无效。删除重复的作品。 这是我的方法。感谢您的支持,这里有什么问题。 谢谢!

      UPDATE dbo.CONTACTS
    SET SUPERID = ADDRESSES.ID FROM dbo.ADDRESSES
inner join CONTACTS on ADDRESSES.ID = CONTACTS.SUPERID
        WHERE ADDRESSES.id in (
    SELECT id  FROM dbo.ADDRESSES 
    WHERE EXISTS(
        SELECT NULL FROM ADDRESSES AS tmpcomment
               WHERE dbo.ADDRESSES.FIRSTNAME0 = tmpcomment.FIRSTNAME0
               AND dbo.ADDRESSES.LASTNAME0 = tmpcomment.LASTNAME0
               and dbo.ADDRESSES.BIRTHDAY1 = tmpcomment.BIRTHDAY1
               HAVING dbo.ADDRESSES.id > MIN(tmpcomment.id)
                       ))

        DELETE FROM ADDRESSES
    WHERE id in (
    SELECT id FROM dbo.ADDRESSES
          WHERE EXISTS(
        SELECT NULL FROM ADDRESSES AS tmpcomment
               WHERE dbo.ADDRESSES.FIRSTNAME0 = tmpcomment.FIRSTNAME0
               AND dbo.ADDRESSES.LASTNAME0 = tmpcomment.LASTNAME0
               and dbo.ADDRESSES.BIRTHDAY1 = tmpcomment.BIRTHDAY1
               HAVING dbo.ADDRESSES.id > MIN(tmpcomment.id)
                       )
                         )

这里是了解问题的样本。

ADDRESSES

|    ID      | DATECREATE  |   LASTNAME0  | FIRSTNAME0  |    BIRTHDAY1 |
|:-----------|------------:|:------------:|------------:|:------------:|
| 1          |  19.07.2011 |     Arthur   |   James     |  05.05.1980  |
| 2          |  23.08.2012 |     Arthur   |   James     |  05.05.1980  |
| 3          |  11.12.2015 |     Arthur   |   James     |  05.05.1980  |
| 4          |  22.10.2016 |     Arthur   |   James     |  05.05.1980  |
| 6          |  20.12.2014 |     Doyle    |   Peter     |  01.01.1950  |
| 7          |  09.01.2016 |     Doyle    |   Peter     |  01.01.1950  |
|:-----------|------------:|:------------:|------------:|:------------:|

CONTACTS
|    ID      | SUPERID  |
|    1       |    1     |
|    2       |    1     |
|    3       |    2     |
|    4       |    2     |
|    5       |    3     |
|    6       |    4     |
|    7       |    4     |
|    8       |    6     |
|    9       |    6     |
|    10      |    6     |
|    11      |    7     |

结果应该是这样

ADDRESSES
    |    ID      | DATECREATE  |   LASTNAME0  | FIRSTNAME0  |    BIRTHDAY1 |
    |:-----------|------------:|:------------:|------------:|:------------:|
    | 4          |  22.10.2016 |     Arthur   |   James     |  05.05.1980  |
    | 7          |  09.01.2016 |     Doyle    |   Peter     |  01.01.1950  |

    CONTACTS

    |    ID      | SUPERID  |
    |    1       |    4     |
    |    2       |    4     |
    |    3       |    4     |
    |    4       |    4     |
    |    5       |    4     |
    |    6       |    4     |
    |    7       |    4     |
    |    8       |    7     |
    |    9       |    7     |
    |    10      |    7     |
    |    11      |    7     |

1 个答案:

答案 0 :(得分:0)

我的方法是使用临时表:

/*


CREATE TABLE addresses
([ID] int, [DATECREATE] varchar(10), [LASTNAME0] varchar(6), [FIRSTNAME0] varchar(5), [BIRTHDAY1] datetime);

INSERT INTO addresses
([ID], [DATECREATE], [LASTNAME0], [FIRSTNAME0], [BIRTHDAY1])
VALUES
(1, '19.07.2011', 'Arthur', 'James', '1980-05-05 00:00:00'),
(2, '23.08.2012', 'Arthur', 'James', '1980-05-05 00:00:00'),
(3, '11.12.2015', 'Arthur', 'James', '1980-05-05 00:00:00'),
(4, '22.10.2016', 'Arthur', 'James', '1980-05-05 00:00:00'),
(6, '20.12.2014', 'Doyle', 'Peter', '1950-01-01 00:00:00'),
(7, '09.01.2016', 'Doyle', 'Peter', '1950-01-01 00:00:00');


CREATE TABLE contacts
([ID] int, [SUPERID] int);

INSERT INTO contacts
([ID], [SUPERID])
VALUES
(1, 1),
(2, 1),
(3, 2),
(4, 2),
(5, 3),
(6, 4),
(7, 4),
(8, 6),
(9, 6),
(10, 6),
(11, 7);

*/


DROP TABLE IF EXISTS #t; --sqls2016+ only, google for an older method if yours is sub 2016
SELECT id as oldid, MAX(id) OVER(PARTITION BY lastname0, firstname0, birthday1) as newid INTO #t
FROM 
  addresses;

/*now #t contains data like 
1, 4
2, 4
3, 4
4, 4
6, 7
7, 7*/

--remove the ones we don't need to change
DELETE FROM #t WHERE oldid = newid;

BEGIN TRANSACTION;
SELECT * FROM addresses;
SELECT * FROM contacts;

--now #t is the list of contact changes we need to make, so make those changes
UPDATE contacts
SET contacts.superid = #t.newid
FROM
  contacts INNER JOIN #t ON contacts.superid = #t.oldid;

--now scrub the old addresses with no contact records. This catches all such records, not just those in #t
DELETE FROM addresses WHERE id NOT IN (SELECT DISTINCT superid FROM contacts);

--alternative to just clean up the records we affected in this operation
DELETE FROM addresses WHERE id IN (SELECT oldid FROM #t);

SELECT * FROM addresses;
SELECT * FROM contacts;
ROLLBACK TRANSACTION;

请注意,我已经对此进行了测试,并且可以生成您想要的结果,但是我提倡谨慎地从Internet复制并运行更新/删除查询。我插入了一个事务,该事务选择前后的数据并回滚该事务,因此不会破坏任何内容。不过先在测试数据库上运行它!