将相同的键分配给重复记录以在SQL

时间:2019-05-11 07:06:50

标签: sql sql-server tsql

在我的客户数据中,有重复的记录,这些记录基于几个关键字段,例如相同的电子邮件地址,电话或不同的记录的邮寄地址。我想根据重复的电子邮件,电话或邮寄地址来识别重复记录集,并为它们分配一个重复编号(相同ID)以将其标记为重复客户。完成此操作后,我想将唯一的客户记录移至另一个表,该表将没有重复项并用作主记录表。

我能够使用density_rank对具有相同编号的重复记录进行排名,但是 之后卡住了,不知道如何为所有记录将新键分配给NewCustID。

初始表格和示例数据

create table Cust_init(
    NewCustID int,
    DW_CustID int,
    FirstName varchar(50),
    LastName varchar(50),
    Email varchar(50),
    MailAddress varchar(50),
    Phone varchar(50)
)

create table MergedCust(
    NewCustID int,
    DW_CustID int,
    FirstName varchar(50),
    LastName varchar(50),
    Email varchar(50),
    MailAddress varchar(50),
    Phone varchar(50)
)


insert into dbo.cust_init(DW_CustID,FirstName, LastName,Email,MailAddress,Phone) 
values(11,'Ahmad','Raza','ahmaddba@gmail.com','154 Zafarwal, Narowaal','0345 2876543'),
      (12,'Iftikhan','Khan','iffikhan@gmail.com','12 A DHA Phase ','0303 56871298'),
      (13,'Iftikhan','Khan','iffikhan@gmail.com','12 A DHA Phase ','0303 56871298'),
      (14,'Mohsin','Khan','mohsinkaz@gmail.com','55 shadab nagar, Lahore','0301 6791255'),
      (15,'Mohsin','Khan','mohsinkaz@gmail.com','55 shadab nagar, Lahore','0301 6791255'),
      (16,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
      (17,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
      (18,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
      (19,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
      (20,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266');

插入数据后,Cust_init表应如下所示:

NewCustID   |DW_CustID  |FirstName  |LastName   |Email                  |MailingAddress         |Phone
NULL        |   11      |Ahmad      |Raza       |ahmaddba@gmail.com     |154 Zafarwal           |0345 2876543
NULL        |   12      |Iftikhan   |Khan       |iffikhan@gmail.com     |12 A DHA Phase         |0303 56871298
NULL        |   13      |Iftikhan   |Khan       |iffikhan@gmail.com     |12 A DHA Phase         |0303 56871298
NULL        |   14      |Mohsin     |Khan       |mohsinkaz@gmail.com    |55 shadab nagar        |0301 6791255
NULL        |   15      |Mohsin     |Khan       |mohsinkaz@gmail.com    |55 shadab nagar        |0301 6791255
NULL        |   16      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
NULL        |   17      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
NULL        |   18      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
NULL        |   19      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
NULL        |   20      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266

第一阶段
我想根据FirstName,LastName,Email识别重复记录,并将新键分配给NewCustID(起始数字最初将为1 然后是初始加载后的最大值+ 1)。 NewCustID数字键将从1开始,并且每个记录的唯一性(重复项除外)。如果重复,则单次 数字键应与所有相关的重复记录相关联。

分配NewCustID后,Cust_init表应如下所示。

NewCustID   |DW_CustID  |FirstName  |LastName   |Email                  |MailingAddress         |Phone
1           |   11      |Ahmad      |Raza       |ahmaddba@gmail.com     |154 Zafarwal           |0345 2876543
2           |   12      |Iftikhan   |Khan       |iffikhan@gmail.com     |12 A DHA Phase         |0303 56871298
2           |   13      |Iftikhan   |Khan       |iffikhan@gmail.com     |12 A DHA Phase         |0303 56871298
3           |   14      |Mohsin     |Khan       |mohsinkaz@gmail.com    |55 shadab nagar        |0301 6791255
3           |   15      |Mohsin     |Khan       |mohsinkaz@gmail.com    |55 shadab nagar        |0301 6791255
4           |   16      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
4           |   17      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
4           |   18      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
4           |   19      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266
4           |   20      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266

第二阶段
在Cust_Init表中分配NewCustID之后,我只想将唯一的行复制到表MergedCust中。仅使用最小DW_CustID保留一行以重复记录。

NewCustID   |DW_CustID  |FirstName  |LastName   |Email                  |MailingAddress         |Phone
1           |   11      |Ahmad      |Raza       |ahmaddba@gmail.com     |154 Zafarwal           |0345 2876543
2           |   12      |Iftikhan   |Khan       |iffikhan@gmail.com     |12 A DHA Phase         |0303 56871298
3           |   14      |Mohsin     |Khan       |mohsinkaz@gmail.com    |55 shadab nagar        |0301 6791255
4           |   16      |Hamid      |Alvi       |hamidalvi@gmail.com    |12 A DHA Phase 2       |0300 7071266

我的努力
我想出了以下sql来对重复编号相同的行进行排名,但不确定如何正确更新NewCustID。

;WITH cte as (
    SELECT  NewCustID, DW_CustID, FirstName,LastName, Email, MailAddress, Phone,
            dense_rank() OVER (ORDER BY FirstName , LastName, Email ) as RN
    FROM dbo.cust_init 
)
select RN,FirstName , LastName, Email 
from cte 

结果集如下所示,我想首先将RN分配给NewCustID,以查看其是否满足目的。

RN  |FirstName  |LastName   |Email
1   |Ahmad      |Raza       |ahmaddba@gmail.com
2   |Hamid      |Alvi       |hamidalvi@gmail.com
2   |Hamid      |Alvi       |hamidalvi@gmail.com
2   |Hamid      |Alvi       |hamidalvi@gmail.com
2   |Hamid      |Alvi       |hamidalvi@gmail.com
2   |Hamid      |Alvi       |hamidalvi@gmail.com
3   |Iftikhan   |Khan       |iffikhan@gmail.com
3   |Iftikhan   |Khan       |iffikhan@gmail.com
4   |Mohsin     |Khan       |mohsinkaz@gmail.com
4   |Mohsin     |Khan       |mohsinkaz@gmail.com

4 个答案:

答案 0 :(得分:1)

这是一个困难且计算量大的问题,因为它涉及沿着三种不同类型的边缘(电子邮件地址,电话和邮件地址)遍历图形。

要通过单个查询解决此问题,可以使用递归CTE。不幸的是,SQL Server不支持数组,因此要避免循环,需要跟踪您遇到的早期ID,这是很多字符串操作。

以下是查询:

with cte as (
      select dw_custId, dw_custId as other_ci,
             convert(varchar(max), concat(',', dw_custId, ',')) as cis,
             convert(varchar(max), ',' + email + ',') as emails,
             convert(varchar(max), ',' + phone + ',') as phones,
             convert(varchar(max), ',' + mailaddress + ',') as mailaddresses,
             1 as lev
      from cust_init
      union all
      select cte.dw_custId, ci.dw_custId,
             concat(cte.cis, ci.dw_custId, ','),
             (case when cte.emails not like concat('%,', ci.email, ',%') then concat(cte.emails, ci.email, ',') else cte.emails end),
             (case when cte.phones not like concat('%,', ci.phone, ',%') then concat(cte.phones, ci.phone, ',') else cte.phones end),
             (case when cte.mailaddresses not like concat('%,', ci.mailaddress, ',%') then concat(cte.mailaddresses, ci.mailaddress, ',') else cte.mailaddresses end),
             lev + 1
      from cte join
           cust_init ci
           on cte.emails like concat('%,', ci.email, ',%') or
              cte.phones = concat('%,', ci.phone, ',%') or
              cte.mailaddresses = concat('%,', ci.mailaddress, ',%')
      where cte.cis not like concat('%,', ci.dw_custId, ',%') and lev < 10
     )
select dw_custid, min(other_ci), dense_rank() over (order by min(other_ci)) as newCustId
from cte
group by dw_custid;

Here是db <>小提琴。

编辑:

您可以在update中使用它:

with cte ( . . . )
update t2
    set newCustId = x.newCustId
    from (select dw_custid, min(other_ci), dense_rank() over (order by min(other_ci)) as newCustId
          from cte
          group by dw_custid
         ) x join
         table2 t2
         on t2.dw_custid = x.dw_custid;

答案 1 :(得分:0)

WITH customers AS (
  SELECT 
    Dense_rank() OVER(
      ORDER BY 
        c.firstname, 
        c.lastname, 
        c.email
    ) AS rn, 
    * 
  FROM 
    #cust_init AS c) 
  INSERT INTO #mergedcust 
  SELECT 
    c.rn AS newcustid, 
    -1 AS DW_CustID, 
    c.firstname, 
    c.lastname, 
    c.email, 
    c.mailaddress, 
    c.phone 
  FROM 
    customers AS c 
  GROUP BY 
    c.rn, 
    c.firstname, 
    c.lastname, 
    c.email, 
    c.mailaddress, 
    c.phone;
SELECT 
  * 
FROM 
  #mergedcust

答案 2 :(得分:0)

试试这个-

SELECT 
ROW_NUMBER() OVER (ORDER BY A.min_cust_id) NewCustID,
B.DW_CustID,
B.FirstName,
B.LastName,
B.Email,
B.MailAddress,
B.Phone
FROM 
(
    SELECT email, MIN(dw_custID) min_cust_id
    FROM cust_init
    GROUP BY EMAIL
)A
INNER JOIN cust_init B ON A.min_cust_id = B.DW_CustID

答案 3 :(得分:0)

这给出了您在上面发布的预期结果。只需使用ROW_NUMBER对重复项进行编号,然后取第一个即可。

WITH CTE AS
(SELECT NewCustID, DW_custID,FirstName,LastName,Email,MailAddress,Phone, 
ROW_NUMBER() OVER(Partition by NewCustID ORDER BY NewCustID) RN
from #Cust_init
)

INSERT INTO #MergedCust
select NewCustID,DW_custID,FirstName,LastName,Email,MailAddress,Phone 
from CTE where RN = 1

SELECT * from #MergedCust

编辑: 鉴于以上数据,我认为您已经弄清楚了如何分配NewCustID。这是我的操作方式:

UPDATE #Cust_init set NewCustID =  DR
       FROM #Cust_init t1
       INNER JOIN (SELECT dw_custid, DENSE_RANK () OVER(order by firstname,lastname,email) DR from #Cust_init) t2
       on t1.DW_CustID = t2.DW_CustID