我正在使用SQL Server2014。目前,我正在尝试将数百万个人应用程序记录合并到一个个人记录中。
记录包含以下几列:
ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email
一个人可以多次输入自己的详细信息,但是由于手指发胖或欺诈,他们有时会输入不正确的详细信息。
在我的示例中,克里斯托弗(Christopher)已填写了5次详细信息,First_Name
,Last_Name
,DOB
总是正确的,Post_Code
,Mobile
和{{ 1}}包含各种含义。
在这种情况下,我想做的就是获取与此组关联的min(id)84015283,并将其放入新表中,这将是主键,然后您将看到与之关联的其他id它。
示例
Email
有点复杂的地方是,两个不同的人可以具有相同的NID CID
------------------
84015283 84015283
84015283 84069198
84015283 84070263
84015283 84369603
84015283 85061159
,First_Name
和Last_Name
,其他字段中的至少一个必须与“ {{1 }},DOB
或post_code
”,例如我在该组中另一条记录上的记录。
尽管ID的84015283、84069198、84070263之间的mobile
,email
,first_name
匹配。84015283、84069198是相同的,所以它们可以匹配而不会出现问题,邮政编码84070263则匹配84369603在移动设备上匹配到以前的记录,在85061159上匹配之前的移动设备/电子邮件,但没有邮政编码。
如果将NID放在原始数据集中比较容易,那么我可以这样做,而不是将其全部放在单独的表中。
经过一番谷歌搜索并设法解决这个问题后,我相信使用“合并”可能是实现我所追求的目标的一种好方法,但是我担心由于记录数量的原因,这将花费很长时间参与其中。
此外,任何例程都必须在随后的新记录上运行。
如果有人可以帮助,我已经列出了示例代码
last_name
以下是预期结果,对不起,我应该在最后明确说明我想要的内容。
输出表结果
DoB
7077084692 Matt@gamil.com
DROP TABLE customer_dist
CREATE TABLE [dbo].customer_dist
(
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL,
)
INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com')
SELECT * FROM customer_dist
慢响应的道歉。
我已经更新了所需的输出,要求我添加一条与其他记录不匹配的额外记录,但未在我的所需输出中包括此记录。
HABO的响应最接近不幸的是,在与其他样本数据进行进一步测试时,创建了重复项并且逻辑崩溃了。其他样本数据将是:-
NID id First_Name Last_Name DoB post_code mobile Email
84015283 84015283 Christopher Higg 1/13/1956 CH2 3AZ 7089559829 CH@hotmail.com
84015283 84069198 Christopher Higg 1/13/1956 CH2 3AZ 7089559829 CH@hotmail.com
84015283 84070263 Christopher Higg 1/13/1956 CH2 3AZ 7089559822 CHigg@AOL.com
84015283 84369603 Christopher Higg 1/13/1956 CH2 3ZA 7089559829 Higg@emailme.com
84015283 85061159 CHRISTOPHER Higg 1/13/1956 CH2 3RA 7089559829 CH@hotmail.com
78065122 87065122 Matthew Davis 05/10/1978 CH5 1TS
答案 0 :(得分:0)
这不是答案,而是注释太长而无法放入注释部分。
由于“平等”条件很复杂,我想我会分阶段进行:
创建相似客户的“存储桶”。值区会识别具有相同ID,first_name,last_name和dob的所有客户。在新的“键”列上添加索引以加快分组速度。一个存储桶可能包含一个或多个真实客户。
select
cast(id as varchar(10)) +
lower(first_name) +
lower(last_name) +
convert(varchar, dob, 23) as k,
id, post_code, mobile, email
into bucket
from customer_dist;
create index ix1 on bucket(k);
在每个存储桶上工作,并在每个存储桶上分离客户。很有可能只有一个,但是可以有多个。
在这里,您需要运行一些迭代算法来比较行,将它们标记为相等的组或不同的组,最后将组合并为单个组。所有这些都是可能的,但是恐怕我看不到如何仅在SQL中做到这一点。
您需要在此处进行一些编码。
答案 1 :(得分:0)
下面的示例使用CTE来对具有匹配的列值(根据要求)的行(通过将表与其自身连接)进行配对。在每对中,“左”行以Id
的顺序位于“右”之前,因此避免了重复的结果,这些结果仅在交换了Id
值之后有所不同。
然后将CTE的结果与每组匹配行的额外行合并,以提供与其自身匹配的奇怪额外行,即NId = Id
。
-- Sample data.
declare @customer_dist as Table (
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL );
INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com');
SELECT * FROM @customer_dist;
-- Process the data.
with PairedRows as (
-- Pairs of rows where the "left" row precedes the "right" in Id order and the rows match per the stated requirements.
select CDL.id as NId, CDR.id as Id
from @customer_dist as CDL inner join
@customer_dist as CDR on
-- Pair rows where the "left" row precedes the "right" in Id order.
CDR.Id > CDL.Id and
-- "Must match" columns.
CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and
-- Plus at least one optional match.
( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email )
-- Where there is not a prior row (in Id order) that matches the "left" row.
where not exists (
select 42 from @customer_dist as NE where NE.ID < CDL.Id and
NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
select NId, Id -- The paired rows.
from PairedRows
union all
-- Add the NId row as a match to itself for every group of paired rows.
select Min( NId ) as NID, Min( NId ) as Id
from PairedRows
group by NId
order by NID, Id;
追逐跳舞问题部分。
以下内容将不成对的任何人通过另一个union all
添加到输出中:
-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in Id order and the rows match per the stated requirements.
select CDL.id as NId, CDR.id as Id
from @customer_dist as CDL inner join
@customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in Id order.
CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
where not exists ( -- Where there is not a ...
select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in Id order) that matches the "left" row.
NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
select NId, Id -- The paired rows.
from PairedRows
union all
select Min( NId ) as NID, Min( NId ) as Id -- Add the NId row as a match to itself for every group of paired rows.
from PairedRows
group by NId
union all
select id, id -- Toss in anyone we haven't heard of.
from @customer_dist as CD
where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id )
order by NID, Id;
再进行一次混搭以显示每个输出行的原因:
-- Sample data.
declare @customer_dist as Table (
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL );
INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07706212920', 'nastie220@yahoo.com'),
('35963960', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863324', 'nastie@hotmail.com'),
('38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com'),
('46653041', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07483888179', 'nastie2010@yahoo.com'),
('48023677', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07483888179', 'nastie@hotmail.com'),
('49560434', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'nastie@hotmail.com'),
('49861032', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07849727199', 'nastie2001@yahoo.com'),
('53130969', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'Nastie@hotmail.cm'),
('33843283', 'Mary', 'Wilson', '1983-09-20', 'BT148HU', '07484863478', 'nastie2010@yahoo.co.uk'),
-- NB: Unique Id in the following row.
('386279750', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com');
INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com');
SELECT * FROM @customer_dist;
select ( select Count(*) from @customer_dist ) as TotalRows, ( select Count( distinct id ) from @customer_dist ) as DistinctIds;
-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in Id order and the rows match per the stated requirements.
select CDL.id as NId, CDR.id as Id
from @customer_dist as CDL inner join
@customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in Id order.
CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
where not exists ( -- Where there is not a ...
select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in Id order) that matches the "left" row.
NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) ),
Results as (
select NId, Id, 'Paired' as Reason -- The paired rows.
from PairedRows
union all
select Min( NId ) as NID, Min( NId ) as Id, 'Self' -- Add the NId row as a match to itself for every group of paired rows.
from PairedRows
group by NId
union all
select id, id, 'Other' -- Toss in anyone we haven't heard of.
from @customer_dist as CD
where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id ) )
select R.NId, R.Id, R.Reason,
CDL.First_Name, CDL.Last_Name,
case when CDL.DoB = CDR.DoB then '=' else '' end as MatchDoB, -- Must match.
case when CDL.post_code = CDR.post_code then '=' else '' end as MatchPostCode,
case when CDL.mobile = CDR.mobile then '=' else '' end as MatchMobile,
case when CDL.Email = CDR.Email then '=' else '' end as MatchEmail,
case when CDL.id = CDR.id then '==' else '' end as MatchSelf,
case when ( select Count(*) from Results as IR where IR.NId = R.NId and IR.Id = R.Id ) > 1 then '#' else '' end as 'Duplicate'
from Results as R inner join
@customer_dist as CDL on CDL.id = R.NId inner join
@customer_dist as CDR on CDR.id = R.Id
order by NID, Id;
答案 2 :(得分:0)
尝试一下(必要的注释在代码中):
;with cte as (
SELECT 1 n, 84015283 CID, * FROM @tbl
where id = 84015283
union all
select c.n + 1, 84015283, t.* from cte c
join @tbl t on
c.First_Name = t.first_name and
c.Last_Name = t.Last_name and
c.DoB = t.DoB and (
c.post_code = t.post_code or
c.mobile = t.mobile or
c.Email = t.Email
) and
--there is no way of writing stop condition here,
--as joining will return in some rows every time,
--so you have to enter here number big enough for
--query to join all records, here 1 suffices
--(if you enter bigger number, result will stay the same
--due to distinct in select)
c.n <= 1
)
select distinct CID,
id NID,
First_Name,
Last_Name,
DoB,
post_code,
mobile,
Email
from cte
另一种方法是使用while
循环:
declare @tempTable table
(
[id] [int] NOT NULL,
[First_Name] [varchar](50) NULL,
[Last_Name] [varchar](50) NULL,
[DoB] [date] NULL,
[post_code] [varchar](50) NULL,
[mobile] [varchar](50) NULL,
[Email] [varchar](100) NULL
);
insert into @tempTable
select *
from @customer_dist
declare @inserted int = -1;
while @inserted <> (select count(*) from @tempTable)
begin
select @inserted = count(*) from @tempTable
insert into @tempTable
select c.* from @customer_dist c
where exists(select 1 from @tempTable t
where c.First_Name = t.first_name and
c.Last_Name = t.Last_name and
c.DoB = t.DoB and (
c.post_code = t.post_code or
c.mobile = t.mobile or
c.Email = t.Email
)
)
except
select * from @tempTable
end
select MAX(NID) over (partition by first_name,last_name) NID,
id, First_Name, Last_Name, DoB, post_code, mobile, Email
from (
select (case when ROW_NUMBER() over (partition by first_name,last_name order by (select null)) = 1 then 1 else 0 end) * id NID,
*
from @tempTable
) a
select * from @tempTable
只要有新记录添加到@tempTable
,它就会循环。使用您的样本数据,它只会循环一次。
与上一个查询的区别在于,由于except
,在循环的每一步都将仅记录新记录,而在CTE
中则无法使用。
它的性能也更好,因为它使用exists
来确定仍要添加的行。在CTE
中不允许这样做,因为CTE
不能出现在子查询中。
最重要的是,它将保证您不会丢失任何记录!在cte
中,您必须用c.n < 1
来限制它,这可能会丢失记录。
答案 3 :(得分:0)
[dbo]。[LEVENSHTEIN]
CREATE FUNCTION [dbo].[LEVENSHTEIN](@left VARCHAR(100),
@right VARCHAR(100))
RETURNS INT
AS
BEGIN
DECLARE @difference INT,
@lenRight INT,
@lenLeft INT,
@leftIndex INT,
@rightIndex INT,
@left_char CHAR(1),
@right_char CHAR(1),
@compareLength INT
SET @lenLeft = LEN(@left)
SET @lenRight = LEN(@right)
SET @difference = 0
IF @lenLeft = 0
BEGIN
SET @difference = @lenRight
GOTO done
END
IF @lenRight = 0
BEGIN
SET @difference = @lenLeft
GOTO done
END
GOTO comparison
COMPARISON:
IF ( @lenLeft >= @lenRight )
SET @compareLength = @lenLeft
ELSE
SET @compareLength = @lenRight
SET @rightIndex = 1
SET @leftIndex = 1
WHILE @leftIndex <= @compareLength
BEGIN
SET @left_char = SUBSTRING(@left, @leftIndex, 1)
SET @right_char = SUBSTRING(@right, @rightIndex, 1)
IF @left_char <> @right_char
BEGIN -- Would an insertion make them re-align?
IF( @left_char = SUBSTRING(@right, @rightIndex + 1, 1) )
SET @rightIndex = @rightIndex + 1
-- Would an deletion make them re-align?
ELSE IF( SUBSTRING(@left, @leftIndex + 1, 1) = @right_char )
SET @leftIndex = @leftIndex + 1
SET @difference = @difference + 1
END
SET @leftIndex = @leftIndex + 1
SET @rightIndex = @rightIndex + 1
END
GOTO done
DONE:
RETURN @difference
END
GO
[dbo]。[GetPercentageOfTwoStringMatching]
CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
@string1 NVARCHAR(100)
,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN
DECLARE @levenShteinNumber INT
DECLARE @string1Length INT = LEN(@string1)
, @string2Length INT = LEN(@string2)
DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END
SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] ( @string1 ,@string2)
DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber
DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters
-- Return the result of the function
RETURN @percentageOfGoodCharacters
END
GO
查询
DECLARE @customer_dist TABLE
(
[id] [INT] NOT NULL ,
[First_Name] [VARCHAR](50) NULL ,
[Last_Name] [VARCHAR](50) NULL ,
[DoB] [DATE] NULL ,
[post_code] [VARCHAR](50) NULL ,
[mobile] [VARCHAR](50) NULL ,
[Email] [VARCHAR](100) NULL
);
INSERT INTO @customer_dist ( id ,
First_Name ,
Last_Name ,
DoB ,
post_code ,
mobile ,
Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559822' , 'CHigg@AOL.com' ) ,
( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
'07089559829' , 'Higg@emailme.com' ) ,
( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
'07077084692' , 'Matt@gamil.com' ) ,
( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
'08009777337' , 'CHigg@gmail.com' );
SELECT result.* ,
[dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM ( SELECT ( SELECT MIN(id)
FROM @customer_dist AS sq
WHERE sq.First_Name = cd.First_Name
AND sq.Last_Name = cd.Last_Name
AND ( sq.mobile = cd.mobile
OR sq.Email = cd.Email
OR sq.post_code = cd.post_code )) nid ,
*
FROM @customer_dist AS cd ) AS result
INNER JOIN @customer_dist d ON result.nid = d.id;
第二个查询
DECLARE @customer_dist TABLE
(
[id] [INT] NOT NULL ,
[First_Name] [VARCHAR](50) NULL ,
[Last_Name] [VARCHAR](50) NULL ,
[DoB] [DATE] NULL ,
[post_code] [VARCHAR](50) NULL ,
[mobile] [VARCHAR](50) NULL ,
[Email] [VARCHAR](100) NULL
);
INSERT INTO @customer_dist ( id ,
First_Name ,
Last_Name ,
DoB ,
post_code ,
mobile ,
Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
'07089559822' , 'CHigg@AOL.com' ) ,
( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
'07089559829' , 'Higg@emailme.com' ) ,
( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
'07089559829' , 'CH@hotmail.com' ) ,
( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
'07077084692' , 'Matt@gamil.com' ) ,
( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
'08009777337' , 'CHigg@gmail.com' );
SELECT result.* ,
[dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
[dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM ( SELECT ( SELECT MIN(id)
FROM @customer_dist AS sq
WHERE sq.First_Name = cd.First_Name
AND sq.Last_Name = cd.Last_Name
AND ( sq.DoB = cd.DoB
OR sq.mobile = cd.mobile
OR sq.Email = cd.Email
OR sq.post_code = cd.post_code )) nid ,
*
FROM @customer_dist AS cd ) AS result
INNER JOIN @customer_dist d ON result.nid = d.id;
答案 4 :(得分:0)
由于您已经提到“组”主要基于三列:FirstName,LastName和DOB,因此您可以创建一个View来跟踪所有记录的最小ID,并在需要时使用该视图进行其他处理。
您还可以创建CTE。这完全取决于您打算如何使用结果集。
我不会尝试更新customer_dist表中的现有记录,因为它将用作原始表,以防万一您想返回并查看用户在不同时间点输入的确切数据(如果您愿意的话)关于统计/数据趋势)
以两种方式查询:
SELECT
MIN(id) AS Min_Id,
LOWER(First_Name) AS firstName, LOWER(Last_Name) As lastName, DoB
FROM
customer_dist
GROUP BY
LOWER(First_Name), LOWER(Last_Name), DoB;
答案 5 :(得分:0)
如果使用UNION,这将是一项繁重的操作,但可以删除重复的行。
此外,我强烈建议您使用SSIS来使用“模糊逻辑”。这是一种用于识别几乎重复项的更有效的方法。这只是我在youtube上找到的一个示例,可以为您指明正确的方向。我希望这会有所帮助。
答案 6 :(得分:0)
我曾经在一家非常老的学校保险公司工作,那里的数据存在类似问题。
我在这里的主要尝试是缩小包含重复项的结果集,从而找到将重复项绑定在一起的方式。一旦掌握了这一点,其余解决方案就会非常快。
逻辑是:基于共享相同值(Fname,Lname,DOB)且偶尔具有相同值(post_code,mobile,email)和更重要的id的列将表连接到自身,更重要的是id不应匹配,这确保排除非-dup记录并仅保留dup。
仅当您使用了dups之后,找到MIN(id)并将其放在cte中,加入原始表,就可以了。非重复记录不需要min-id,因为它的id是min-id。
;WITH DUPS AS
(
SELECT DISTINCT
MIN(C1.ID) OVER(PARTITION BY C1.First_Name,
C1.Last_Name, C1.DoB) AS minid,
C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1
INNER join customer_dist c2
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile
OR
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)
SELECT ISNULL(D.minid, C.ID) AS NID,
C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id
答案 7 :(得分:0)
也许是最优雅的解决方案OVER PARTITION BY来匹配它们。通常,如果您所有的条件都可以“与”在一起,那将很简单。由于您在post_code,mobile和email列上需要一些OR逻辑,因此您需要添加一些额外的步骤。
SELECT
*,
NID_post_code = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
NID_mobile = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
NID_email = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
FROM @customer_dist
) AS cd
现在,您有一个结果集,可以根据三组不同的条件为您显示每个ID和最低的匹配ID:
我们知道,这三个条件中每一个的最小匹配ID是我们想要的那个...
SELECT
NID = (
SELECT
MIN(NID)
FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
),
cd.*
FROM (
SELECT
*,
NID_post_code = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
NID_mobile = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
NID_email = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
FROM @customer_dist
) AS cd
order BY (
SELECT
MIN(NID)
FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
您可以使用这些结果来创建查找/外部参照表,也可以在原始表中添加NID列并将这些结果合并到其中。
DECLARE @customer_dist AS table (
id int NOT NULL,
First_Name varchar(50) NULL,
Last_Name varchar(50) NULL,
DoB date NULL,
post_code varchar(50) NULL,
mobile varchar(50) NULL,
Email varchar(100) NULL
);
INSERT INTO @customer_dist ( id, First_Name , Last_Name, DoB, post_code, mobile, Email )
VALUES
( '32006455', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07706212920', 'nastie220@yahoo.com' ),
( '35963960', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863324', 'nastie@hotmail.com' ),
( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com' ),
( '46653041', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07483888179', 'nastie2010@yahoo.com' ),
( '48023677', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07483888179', 'nastie@hotmail.com' ),
( '49560434', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'nastie@hotmail.com' ),
( '49861032', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07849727199', 'nastie2001@yahoo.com' ),
( '53130969', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'Nastie@hotmail.cm' ),
( '33843283', 'Mary', 'Wilson', '1983-09-20', 'BT148HU', '07484863478', 'nastie2010@yahoo.co.uk' ),
( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com' ),
( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com' ),
( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com' ),
( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com' ),
( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com' ),
( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com' ),
( '84369605', 'Christopher', 'Higg', '1956-01-13', 'CH2 ZZZ', '07089559999', 'chrish@gmail.com' ),
( '84369677', 'Christopher', 'Higg', '1956-01-13', 'AH2 ZZZ', '09089559999', 'chrish@gmail.com' ),
( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com' ),
( '87065123', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com' )
SELECT
NID = (
SELECT
MIN(NID)
FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
),
cd.*
FROM (
SELECT
*,
NID_post_code = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
NID_mobile = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
NID_email = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
FROM @customer_dist
) AS cd
order BY (
SELECT
MIN(NID)
FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
)
答案 8 :(得分:0)
这最终似乎对我来说是一个数据排名问题。考虑到这一点,我们可以使用DENSE_RANK窗口函数来确定如何将我们的帐户分组在一起。以下示例显示了如何可能完成此操作。
--header