在我的客户数据中,有重复的记录,这些记录基于几个关键字段,例如相同的电子邮件地址,电话或不同的记录的邮寄地址。我想根据重复的电子邮件,电话或邮寄地址来识别重复记录集,并为它们分配一个重复编号(相同ID)以将其标记为重复客户。完成此操作后,我想将唯一的客户记录移至另一个表,该表将没有重复项并用作主记录表。
我能够使用density_rank对具有相同编号的重复记录进行排名,但是 之后卡住了,不知道如何为所有记录将新键分配给NewCustID。
初始表格和示例数据
create table Cust_init(
NewCustID int,
DW_CustID int,
FirstName varchar(50),
LastName varchar(50),
Email varchar(50),
MailAddress varchar(50),
Phone varchar(50)
)
create table MergedCust(
NewCustID int,
DW_CustID int,
FirstName varchar(50),
LastName varchar(50),
Email varchar(50),
MailAddress varchar(50),
Phone varchar(50)
)
insert into dbo.cust_init(DW_CustID,FirstName, LastName,Email,MailAddress,Phone)
values(11,'Ahmad','Raza','ahmaddba@gmail.com','154 Zafarwal, Narowaal','0345 2876543'),
(12,'Iftikhan','Khan','iffikhan@gmail.com','12 A DHA Phase ','0303 56871298'),
(13,'Iftikhan','Khan','iffikhan@gmail.com','12 A DHA Phase ','0303 56871298'),
(14,'Mohsin','Khan','mohsinkaz@gmail.com','55 shadab nagar, Lahore','0301 6791255'),
(15,'Mohsin','Khan','mohsinkaz@gmail.com','55 shadab nagar, Lahore','0301 6791255'),
(16,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
(17,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
(18,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
(19,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266'),
(20,'Hamid','Alvi','hamidalvi@gmail.com','12 A DHA Phase 2','0300 7071266');
插入数据后,Cust_init表应如下所示:
NewCustID |DW_CustID |FirstName |LastName |Email |MailingAddress |Phone
NULL | 11 |Ahmad |Raza |ahmaddba@gmail.com |154 Zafarwal |0345 2876543
NULL | 12 |Iftikhan |Khan |iffikhan@gmail.com |12 A DHA Phase |0303 56871298
NULL | 13 |Iftikhan |Khan |iffikhan@gmail.com |12 A DHA Phase |0303 56871298
NULL | 14 |Mohsin |Khan |mohsinkaz@gmail.com |55 shadab nagar |0301 6791255
NULL | 15 |Mohsin |Khan |mohsinkaz@gmail.com |55 shadab nagar |0301 6791255
NULL | 16 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
NULL | 17 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
NULL | 18 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
NULL | 19 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
NULL | 20 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
第一阶段
我想根据FirstName,LastName,Email识别重复记录,并将新键分配给NewCustID(起始数字最初将为1
然后是初始加载后的最大值+ 1)。 NewCustID数字键将从1开始,并且每个记录的唯一性(重复项除外)。如果重复,则单次
数字键应与所有相关的重复记录相关联。
分配NewCustID后,Cust_init表应如下所示。
NewCustID |DW_CustID |FirstName |LastName |Email |MailingAddress |Phone
1 | 11 |Ahmad |Raza |ahmaddba@gmail.com |154 Zafarwal |0345 2876543
2 | 12 |Iftikhan |Khan |iffikhan@gmail.com |12 A DHA Phase |0303 56871298
2 | 13 |Iftikhan |Khan |iffikhan@gmail.com |12 A DHA Phase |0303 56871298
3 | 14 |Mohsin |Khan |mohsinkaz@gmail.com |55 shadab nagar |0301 6791255
3 | 15 |Mohsin |Khan |mohsinkaz@gmail.com |55 shadab nagar |0301 6791255
4 | 16 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
4 | 17 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
4 | 18 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
4 | 19 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
4 | 20 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
第二阶段
在Cust_Init表中分配NewCustID之后,我只想将唯一的行复制到表MergedCust中。仅使用最小DW_CustID保留一行以重复记录。
NewCustID |DW_CustID |FirstName |LastName |Email |MailingAddress |Phone
1 | 11 |Ahmad |Raza |ahmaddba@gmail.com |154 Zafarwal |0345 2876543
2 | 12 |Iftikhan |Khan |iffikhan@gmail.com |12 A DHA Phase |0303 56871298
3 | 14 |Mohsin |Khan |mohsinkaz@gmail.com |55 shadab nagar |0301 6791255
4 | 16 |Hamid |Alvi |hamidalvi@gmail.com |12 A DHA Phase 2 |0300 7071266
我的努力
我想出了以下sql来对重复编号相同的行进行排名,但不确定如何正确更新NewCustID。
;WITH cte as (
SELECT NewCustID, DW_CustID, FirstName,LastName, Email, MailAddress, Phone,
dense_rank() OVER (ORDER BY FirstName , LastName, Email ) as RN
FROM dbo.cust_init
)
select RN,FirstName , LastName, Email
from cte
结果集如下所示,我想首先将RN分配给NewCustID,以查看其是否满足目的。
RN |FirstName |LastName |Email
1 |Ahmad |Raza |ahmaddba@gmail.com
2 |Hamid |Alvi |hamidalvi@gmail.com
2 |Hamid |Alvi |hamidalvi@gmail.com
2 |Hamid |Alvi |hamidalvi@gmail.com
2 |Hamid |Alvi |hamidalvi@gmail.com
2 |Hamid |Alvi |hamidalvi@gmail.com
3 |Iftikhan |Khan |iffikhan@gmail.com
3 |Iftikhan |Khan |iffikhan@gmail.com
4 |Mohsin |Khan |mohsinkaz@gmail.com
4 |Mohsin |Khan |mohsinkaz@gmail.com
答案 0 :(得分:1)
这是一个困难且计算量大的问题,因为它涉及沿着三种不同类型的边缘(电子邮件地址,电话和邮件地址)遍历图形。
要通过单个查询解决此问题,可以使用递归CTE。不幸的是,SQL Server不支持数组,因此要避免循环,需要跟踪您遇到的早期ID,这是很多字符串操作。
以下是查询:
with cte as (
select dw_custId, dw_custId as other_ci,
convert(varchar(max), concat(',', dw_custId, ',')) as cis,
convert(varchar(max), ',' + email + ',') as emails,
convert(varchar(max), ',' + phone + ',') as phones,
convert(varchar(max), ',' + mailaddress + ',') as mailaddresses,
1 as lev
from cust_init
union all
select cte.dw_custId, ci.dw_custId,
concat(cte.cis, ci.dw_custId, ','),
(case when cte.emails not like concat('%,', ci.email, ',%') then concat(cte.emails, ci.email, ',') else cte.emails end),
(case when cte.phones not like concat('%,', ci.phone, ',%') then concat(cte.phones, ci.phone, ',') else cte.phones end),
(case when cte.mailaddresses not like concat('%,', ci.mailaddress, ',%') then concat(cte.mailaddresses, ci.mailaddress, ',') else cte.mailaddresses end),
lev + 1
from cte join
cust_init ci
on cte.emails like concat('%,', ci.email, ',%') or
cte.phones = concat('%,', ci.phone, ',%') or
cte.mailaddresses = concat('%,', ci.mailaddress, ',%')
where cte.cis not like concat('%,', ci.dw_custId, ',%') and lev < 10
)
select dw_custid, min(other_ci), dense_rank() over (order by min(other_ci)) as newCustId
from cte
group by dw_custid;
Here是db <>小提琴。
编辑:
您可以在update
中使用它:
with cte ( . . . )
update t2
set newCustId = x.newCustId
from (select dw_custid, min(other_ci), dense_rank() over (order by min(other_ci)) as newCustId
from cte
group by dw_custid
) x join
table2 t2
on t2.dw_custid = x.dw_custid;
答案 1 :(得分:0)
WITH customers AS (
SELECT
Dense_rank() OVER(
ORDER BY
c.firstname,
c.lastname,
c.email
) AS rn,
*
FROM
#cust_init AS c)
INSERT INTO #mergedcust
SELECT
c.rn AS newcustid,
-1 AS DW_CustID,
c.firstname,
c.lastname,
c.email,
c.mailaddress,
c.phone
FROM
customers AS c
GROUP BY
c.rn,
c.firstname,
c.lastname,
c.email,
c.mailaddress,
c.phone;
SELECT
*
FROM
#mergedcust
答案 2 :(得分:0)
试试这个-
SELECT
ROW_NUMBER() OVER (ORDER BY A.min_cust_id) NewCustID,
B.DW_CustID,
B.FirstName,
B.LastName,
B.Email,
B.MailAddress,
B.Phone
FROM
(
SELECT email, MIN(dw_custID) min_cust_id
FROM cust_init
GROUP BY EMAIL
)A
INNER JOIN cust_init B ON A.min_cust_id = B.DW_CustID
答案 3 :(得分:0)
这给出了您在上面发布的预期结果。只需使用ROW_NUMBER对重复项进行编号,然后取第一个即可。
WITH CTE AS
(SELECT NewCustID, DW_custID,FirstName,LastName,Email,MailAddress,Phone,
ROW_NUMBER() OVER(Partition by NewCustID ORDER BY NewCustID) RN
from #Cust_init
)
INSERT INTO #MergedCust
select NewCustID,DW_custID,FirstName,LastName,Email,MailAddress,Phone
from CTE where RN = 1
SELECT * from #MergedCust
编辑: 鉴于以上数据,我认为您已经弄清楚了如何分配NewCustID。这是我的操作方式:
UPDATE #Cust_init set NewCustID = DR
FROM #Cust_init t1
INNER JOIN (SELECT dw_custid, DENSE_RANK () OVER(order by firstname,lastname,email) DR from #Cust_init) t2
on t1.DW_CustID = t2.DW_CustID