将重复的记录与“合并”语法合并在一起

时间:2018-08-22 14:20:47

标签: sql sql-server tsql sql-server-2014

我正在使用SQL Server2014。目前,我正在尝试将数百万个人应用程序记录合并到一个个人记录中。

记录包含以下几列:

ID, First_Name, Last_Name, DOB, Post_Code, Mobile, Email

一个人可以多次输入自己的详细信息,但是由于手指发胖或欺诈,他们有时会输入不正确的详细信息。

在我的示例中,克里斯托弗(Christopher)已填写了5次详细信息,First_NameLast_NameDOB总是正确的,Post_CodeMobile和{{ 1}}包含各种含义。

在这种情况下,我想做的就是获取与此组关联的min(id)84015283,并将其放入新表中,这将是主键,然后您将看到与之关联的其他id它。

示例

Email

有点复杂的地方是,两个不同的人可以具有相同的NID CID ------------------ 84015283 84015283 84015283 84069198 84015283 84070263 84015283 84369603 84015283 85061159 First_NameLast_Name,其他字段中的至少一个必须与“ {{1 }},DOBpost_code”,例如我在该组中另一条记录上的记录。

尽管ID的84015283、84069198、84070263之间的mobileemailfirst_name匹配。84015283、84069198是相同的,所以它们可以匹配而不会出现问题,邮政编码84070263则匹配84369603在移动设备上匹配到以前的记录,在85061159上匹配之前的移动设备/电子邮件,但没有邮政编码。

如果将NID放在原始数据集中比较容易,那么我可以这样做,而不是将其全部放在单独的表中。

经过一番谷歌搜索并设法解决这个问题后,我相信使用“合并”可能是实现我所追求的目标的一种好方法,但是我担心由于记录数量的原因,这将花费很长时间参与其中。

此外,任何例程都必须在随后的新记录上运行。

如果有人可以帮助,我已经列出了示例代码

last_name

以下是预期结果,对不起,我应该在最后明确说明我想要的内容。

输出表结果

DoB

7077084692 Matt@gamil.com

DROP TABLE customer_dist

CREATE TABLE [dbo].customer_dist
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL,
)

INSERT INTO customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com')

SELECT * FROM customer_dist

慢响应的道歉。

我已经更新了所需的输出,要求我添加一条与其他记录不匹配的额外记录,但未在我的所需输出中包括此记录。

HABO的响应最接近不幸的是,在与其他样本数据进行进一步测试时,创建了重复项并且逻辑崩溃了。其他样本数据将是:-

    NID         id          First_Name  Last_Name   DoB         post_code   mobile          Email
    84015283    84015283    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  CH@hotmail.com
    84015283    84069198    Christopher Higg            1/13/1956   CH2 3AZ         7089559829  CH@hotmail.com
    84015283    84070263    Christopher Higg            1/13/1956   CH2 3AZ         7089559822  CHigg@AOL.com
    84015283    84369603    Christopher Higg            1/13/1956   CH2 3ZA         7089559829  Higg@emailme.com
    84015283    85061159    CHRISTOPHER Higg            1/13/1956   CH2 3RA         7089559829  CH@hotmail.com
    78065122    87065122    Matthew Davis               05/10/1978  CH5 1TS

9 个答案:

答案 0 :(得分:0)

这不是答案,而是注释太长而无法放入注释部分。

由于“平等”条件很复杂,我想我会分阶段进行:

  1. 创建相似客户的“存储桶”。值区会识别具有相同ID,first_name,last_name和dob的所有客户。在新的“键”列上添加索引以加快分组速度。一个存储桶可能包含一个或多个真实客户。

    select
        cast(id as varchar(10)) +
        lower(first_name) + 
        lower(last_name) + 
        convert(varchar, dob, 23) as k,
        id, post_code, mobile, email
        into bucket
      from customer_dist;
    
    create index ix1 on bucket(k);
    
  2. 在每个存储桶上工作,并在每个存储桶上分离客户。很有可能只有一个,但是可以有多个。

在这里,您需要运行一些迭代算法来比较行,将它们标记为相等的组或不同的组,最后将组合并为单个组。所有这些都是可能的,但是恐怕我看不到如何仅在SQL中做到这一点。

您需要在此处进行一些编码。

答案 1 :(得分:0)

下面的示例使用CTE来对具有匹配的列值(根据要求)的行(通过将表与其自身连接)进行配对。在每对中,“左”行以Id的顺序位于“右”之前,因此避免了重复的结果,这些结果仅在交换了Id值之后有所不同。

然后将CTE的结果与每组匹配行的额外行合并,以提供与其自身匹配的奇怪额外行,即NId = Id

-- Sample data.
declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com');

SELECT * FROM @customer_dist;

-- Process the data.
with PairedRows as (
  -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on
        -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.Id > CDL.Id and
        -- "Must match" columns.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and
        -- Plus at least one optional match.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email )
    -- Where there is not a prior row (in   Id   order) that matches the "left" row.
    where not exists (
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and 
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
  select NId, Id -- The paired rows.
    from PairedRows
  union all
  -- Add the   NId   row as a match to itself for every group of paired rows.
  select Min( NId ) as NID, Min( NId ) as Id
    from PairedRows
    group by NId
  order by NID, Id;

追逐跳舞问题部分。

以下内容将不成对的任何人通过另一个union all添加到输出中:

-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
    where not exists ( -- Where there is not a ...
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in   Id   order) that matches the "left" row.
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) )
  select NId, Id -- The paired rows.
    from PairedRows
  union all
  select Min( NId ) as NID, Min( NId ) as Id -- Add the   NId   row as a match to itself for every group of paired rows.
    from PairedRows
    group by NId
  union all
  select id, id -- Toss in anyone we haven't heard of.
    from @customer_dist as CD
    where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id )
  order by NID, Id;

再进行一次混搭以显示每个输出行的原因:

-- Sample data.
declare @customer_dist as Table (
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL );

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('32006455', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07706212920',  'nastie220@yahoo.com'),
       ('35963960', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863324',  'nastie@hotmail.com'),
       ('38627975', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  'nastie2001@yahoo.com'),
       ('46653041', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07483888179',  'nastie2010@yahoo.com'),
       ('48023677', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07483888179',  'nastie@hotmail.com'),
       ('49560434', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  'nastie@hotmail.com'),
       ('49861032', 'Mary', 'WILSON',   '1983-09-20',   'BT62JA',   '07849727199',  'nastie2001@yahoo.com'),
       ('53130969', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07849727199',  'Nastie@hotmail.cm'),
       ('33843283', 'Mary', 'Wilson',   '1983-09-20',   'BT148HU',  '07484863478',  'nastie2010@yahoo.co.uk'),
       -- NB: Unique   Id   in the following row.
       ('386279750', 'Mary', 'Wilson',   '1983-09-20',   'BT62JA',   '07484863478',  'nastie2001@yahoo.com');

INSERT INTO @customer_dist (id, First_Name, Last_Name, DoB, post_code, mobile, Email)
VALUES ('84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com'),
       ('84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com'),
       ('84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com'),
       ('85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com'),
       ('87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com');

SELECT * FROM @customer_dist;
select ( select Count(*) from @customer_dist ) as TotalRows, ( select Count( distinct id ) from @customer_dist ) as DistinctIds;

-- Process the data.
with PairedRows as ( -- Pairs of rows where the "left" row precedes the "right" in   Id   order and the rows match per the stated requirements.
  select CDL.id as NId, CDR.id as Id
    from @customer_dist as CDL inner join
      @customer_dist as CDR on CDR.Id > CDL.Id and -- Pair rows where the "left" row precedes the "right" in   Id   order.
        CDR.First_Name = CDL.First_Name and CDR.Last_Name = CDL.Last_Name and CDR.DoB = CDL.DoB and -- "Must match" columns.
        ( CDR.post_code = CDL.post_code or CDR.mobile = CDL.mobile or CDR.Email = CDL.Email ) -- Plus at least one optional match.
    where not exists ( -- Where there is not a ...
      select 42 from @customer_dist as NE where NE.ID < CDL.Id and -- ... prior row (in   Id   order) that matches the "left" row.
        NE.First_Name = CDL.First_Name and NE.Last_Name = CDL.Last_Name and NE.DoB = CDL.DoB and
        ( NE.post_code = CDL.post_code or NE.mobile = CDL.mobile or NE.Email = CDL.Email ) ) ),
  Results as (
    select NId, Id, 'Paired' as Reason -- The paired rows.
      from PairedRows
    union all
    select Min( NId ) as NID, Min( NId ) as Id, 'Self' -- Add the   NId   row as a match to itself for every group of paired rows.
      from PairedRows
      group by NId
    union all
    select id, id, 'Other' -- Toss in anyone we haven't heard of.
      from @customer_dist as CD
      where not exists ( select 42 from PairedRows as PR where PR.NId = CD.id or PR.Id = CD.id ) )
  select R.NId, R.Id, R.Reason,
    CDL.First_Name, CDL.Last_Name,
    case when CDL.DoB = CDR.DoB then '=' else '' end as MatchDoB, -- Must match.
    case when CDL.post_code = CDR.post_code then '=' else '' end as MatchPostCode,
    case when CDL.mobile = CDR.mobile then '=' else '' end as MatchMobile,
    case when CDL.Email = CDR.Email then '=' else '' end as MatchEmail,
    case when CDL.id = CDR.id then '==' else '' end as MatchSelf,
    case when ( select Count(*) from Results as IR where IR.NId = R.NId and IR.Id = R.Id ) > 1 then '#' else '' end as 'Duplicate'
    from Results as R inner join
      @customer_dist as CDL on CDL.id = R.NId inner join
      @customer_dist as CDR on CDR.id = R.Id
    order by NID, Id;

答案 2 :(得分:0)

尝试一下(必要的注释在代码中):

;with cte as (
    SELECT 1 n, 84015283 CID, * FROM @tbl
    where id = 84015283
    union all 
    select c.n + 1, 84015283, t.* from cte c
    join @tbl t on
        c.First_Name = t.first_name and
        c.Last_Name = t.Last_name and
        c.DoB = t.DoB and (
        c.post_code = t.post_code or
        c.mobile = t.mobile or
        c.Email = t.Email 
        ) and
        --there is no way of writing stop condition here,
        --as joining will return in some rows every time,
        --so you have to enter here number big enough for
        --query to join all records, here 1 suffices
        --(if you enter bigger number, result will stay the same
        --due to distinct in select)
        c.n <= 1
)

select distinct CID, 
                id NID, 
                First_Name, 
                Last_Name, 
                DoB, 
                post_code, 
                mobile, 
                Email 
from cte

另一种方法是使用while循环:

declare @tempTable table
(
    [id] [int] NOT NULL,
    [First_Name] [varchar](50) NULL,
    [Last_Name] [varchar](50) NULL,
    [DoB] [date] NULL,
    [post_code] [varchar](50) NULL,
    [mobile] [varchar](50) NULL,
    [Email] [varchar](100) NULL
);
insert into @tempTable
select *
from @customer_dist

declare @inserted int = -1;
while @inserted <> (select count(*) from @tempTable)
begin
    select @inserted = count(*) from @tempTable
    insert into @tempTable
    select c.* from @customer_dist c
    where exists(select 1 from @tempTable t
                 where c.First_Name = t.first_name and
                       c.Last_Name = t.Last_name and
                       c.DoB = t.DoB and (
                       c.post_code = t.post_code or
                       c.mobile = t.mobile or
                       c.Email = t.Email 
                       )
                 )
    except
    select * from @tempTable
end

select MAX(NID) over (partition by first_name,last_name) NID,
       id, First_Name, Last_Name, DoB, post_code, mobile, Email
from (
    select (case when ROW_NUMBER() over (partition by first_name,last_name order by (select null)) = 1 then 1 else 0 end) * id NID,
           *
    from @tempTable
) a

select * from @tempTable

只要有新记录添加到@tempTable,它就会循环。使用您的样本数据,它只会循环一次。

与上一个查询的区别在于,由于except,在循环的每一步都将仅记录新记录,而在CTE中则无法使用。

它的性能也更好,因为它使用exists来确定仍要添加的行。在CTE中不允许这样做,因为CTE不能出现在子查询中。

最重要的是,它将保证您不会丢失任何记录!在cte中,您必须用c.n < 1来限制它,这可能会丢失记录。

答案 3 :(得分:0)

[dbo]。[LEVENSHTEIN]

CREATE FUNCTION [dbo].[LEVENSHTEIN](@left  VARCHAR(100),
                                @right VARCHAR(100))
RETURNS INT
AS
  BEGIN
      DECLARE @difference    INT,
              @lenRight      INT,
              @lenLeft       INT,
              @leftIndex     INT,
              @rightIndex    INT,
              @left_char     CHAR(1),
              @right_char    CHAR(1),
              @compareLength INT

      SET @lenLeft = LEN(@left)
      SET @lenRight = LEN(@right)
      SET @difference = 0

      IF @lenLeft = 0
        BEGIN
            SET @difference = @lenRight

            GOTO done
        END

      IF @lenRight = 0
        BEGIN
            SET @difference = @lenLeft

            GOTO done
        END

      GOTO comparison

      COMPARISON:

      IF ( @lenLeft >= @lenRight )
        SET @compareLength = @lenLeft
      ELSE
        SET @compareLength = @lenRight

      SET @rightIndex = 1
      SET @leftIndex = 1

      WHILE @leftIndex <= @compareLength
        BEGIN
            SET @left_char = SUBSTRING(@left, @leftIndex, 1)
            SET @right_char = SUBSTRING(@right, @rightIndex, 1)

            IF @left_char <> @right_char
              BEGIN -- Would an insertion make them re-align?
                  IF( @left_char = SUBSTRING(@right, @rightIndex + 1, 1) )
                    SET @rightIndex = @rightIndex + 1
                  -- Would an deletion make them re-align?
                  ELSE IF( SUBSTRING(@left, @leftIndex + 1, 1) = @right_char )
                    SET @leftIndex = @leftIndex + 1

                  SET @difference = @difference + 1
              END

            SET @leftIndex = @leftIndex + 1
            SET @rightIndex = @rightIndex + 1
        END

      GOTO done

      DONE:

          RETURN @difference
      END

    GO

[dbo]。[GetPercentageOfTwoStringMatching]

CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(100)
    ,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1)
    , @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] (   @string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters
END
GO

查询

    DECLARE @customer_dist TABLE
    (
        [id] [INT] NOT NULL ,
        [First_Name] [VARCHAR](50) NULL ,
        [Last_Name] [VARCHAR](50) NULL ,
        [DoB] [DATE] NULL ,
        [post_code] [VARCHAR](50) NULL ,
        [mobile] [VARCHAR](50) NULL ,
        [Email] [VARCHAR](100) NULL
    );

INSERT INTO @customer_dist ( id ,
                             First_Name ,
                             Last_Name ,
                             DoB ,
                             post_code ,
                             mobile ,
                             Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559822' , 'CHigg@AOL.com' ) ,
       ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
         '07089559829' , 'Higg@emailme.com' ) ,
       ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
         '07077084692' , 'Matt@gamil.com' ) ,
       ( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
         '08009777337' , 'CHigg@gmail.com' );



SELECT result.* ,
       [dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM   (   SELECT (   SELECT MIN(id)
                      FROM   @customer_dist AS sq
                      WHERE  sq.First_Name = cd.First_Name
                             AND sq.Last_Name = cd.Last_Name
                             AND (   sq.mobile = cd.mobile
                                     OR sq.Email = cd.Email
                                     OR sq.post_code = cd.post_code )) nid ,
                  *
           FROM   @customer_dist AS cd ) AS result
       INNER JOIN @customer_dist d ON result.nid = d.id;

结果 Result

第二个查询

    DECLARE @customer_dist TABLE
    (
        [id] [INT] NOT NULL ,
        [First_Name] [VARCHAR](50) NULL ,
        [Last_Name] [VARCHAR](50) NULL ,
        [DoB] [DATE] NULL ,
        [post_code] [VARCHAR](50) NULL ,
        [mobile] [VARCHAR](50) NULL ,
        [Email] [VARCHAR](100) NULL
    );

INSERT INTO @customer_dist ( id ,
                             First_Name ,
                             Last_Name ,
                             DoB ,
                             post_code ,
                             mobile ,
                             Email )
VALUES ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ' ,
         '07089559822' , 'CHigg@AOL.com' ) ,
       ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA' ,
         '07089559829' , 'Higg@emailme.com' ) ,
       ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA' ,
         '07089559829' , 'CH@hotmail.com' ) ,
       ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS' ,
         '07077084692' , 'Matt@gamil.com' ) ,
       ( '94015281', 'Christopher', 'Higg', '1956-01-13', 'NN2 1XH' ,
         '08009777337' , 'CHigg@gmail.com' );



SELECT result.* ,
       [dbo].GetPercentageOfTwoStringMatching(result.DoB, d.DoB) [DOB%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.post_code, d.post_code) [post_code%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.mobile, d.mobile) [mobile%match] ,
       [dbo].GetPercentageOfTwoStringMatching(result.Email, d.Email) [email%match]
FROM   (   SELECT (   SELECT MIN(id)
                      FROM   @customer_dist AS sq
                      WHERE  sq.First_Name = cd.First_Name
                             AND sq.Last_Name = cd.Last_Name
                             AND (  sq.DoB = cd.DoB   
                                     OR sq.mobile = cd.mobile
                                     OR sq.Email = cd.Email
                                     OR sq.post_code = cd.post_code )) nid ,
                  *
           FROM   @customer_dist AS cd ) AS result
       INNER JOIN @customer_dist d ON result.nid = d.id;

结果: enter image description here

答案 4 :(得分:0)

由于您已经提到“组”主要基于三列:FirstName,LastName和DOB,因此您可以创建一个View来跟踪所有记录的最小ID,并在需要时使用该视图进行其他处理。

您还可以创建CTE。这完全取决于您打算如何使用结果集。

我不会尝试更新customer_dist表中的现有记录,因为它将用作原始表,以防万一您想返回并查看用户在不同时间点输入的确切数据(如果您愿意的话)关于统计/数据趋势)

以两种方式查询:

SELECT 
  MIN(id) AS Min_Id,
  LOWER(First_Name) AS firstName, LOWER(Last_Name) As lastName, DoB
FROM
customer_dist
GROUP BY 
LOWER(First_Name), LOWER(Last_Name), DoB;

View example

CTE example

答案 5 :(得分:0)

如果使用UNION,这将是一项繁重的操作,但可以删除重复的行。

此外,我强烈建议您使用SSIS来使用“模糊逻辑”。这是一种用于识别几乎重复项的更有效的方法。这只是我在youtube上找到的一个示例,可以为您指明正确的方向。我希望这会有所帮助。

https://www.youtube.com/watch?v=eVOmXssmB7I

答案 6 :(得分:0)

我曾经在一家非常老的学校保险公司工作,那里的数据存在类似问题。

我在这里的主要尝试是缩小包含重复项的结果集,从而找到将重复项绑定在一起的方式。一旦掌握了这一点,其余解决方案就会非常快。

逻辑是:基于共享相同值(Fname,Lname,DOB)且偶尔具有相同值(post_code,mobile,email)和更重要的id的列将表连接到自身,更重要的是id不应匹配,这确保排除非-dup记录并仅保留dup。

仅当您使用了dups之后,找到MIN(id)并将其放在cte中,加入原始表,就可以了。非重复记录不需要min-id,因为它的id是min-id。

;WITH DUPS AS
(
SELECT DISTINCT
    MIN(C1.ID) OVER(PARTITION BY C1.First_Name, 
C1.Last_Name, C1.DoB) AS minid,
    C1.id, C1.First_Name, C1.Last_Name, C1.DoB
FROM customer_dist c1 
INNER join customer_dist c2 
ON
c1.First_Name = c2.First_Name
AND c1.Last_Name = c2.Last_Name
AND c1.DoB = c2.DoB
AND (c1.post_code = c2.post_code OR c1.mobile = c2.mobile 
OR 
c1.Email = c2.Email)
AND C1.ID <> C2.ID
)

SELECT ISNULL(D.minid, C.ID) AS NID,
        C.*
FROM customer_dist C
LEFT JOIN DUPS D ON C.id = D.id

答案 7 :(得分:0)

也许是最优雅的解决方案OVER PARTITION BY来匹配它们。通常,如果您所有的条件都可以“与”在一起,那将很简单。由于您在post_code,mobile和email列上需要一些OR逻辑,因此您需要添加一些额外的步骤。

首先找到三种匹配方式的MIN()匹配方式

    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd

现在,您有一个结果集,可以根据三组不同的条件为您显示每个ID和最低的匹配ID: Lowest ID match on three different criteria

我们知道,这三个条件中每一个的最小匹配ID是我们想要的那个...

用一些交叉应用样式魔术来包装您的查询

SELECT
    NID = (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    ),
    cd.*
FROM    (
    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd
order BY (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)

结果如下: enter image description here

您可以使用这些结果来创建查找/外部参照表,也可以在原始表中添加NID列并将这些结果合并到其中。

使用一些额外的数据来完成查询

DECLARE @customer_dist AS table (
    id          int             NOT NULL,
    First_Name  varchar(50)     NULL,
    Last_Name   varchar(50)     NULL,
    DoB         date            NULL,
    post_code   varchar(50)     NULL,
    mobile      varchar(50)     NULL,
    Email       varchar(100)    NULL
);


INSERT INTO @customer_dist ( id, First_Name , Last_Name, DoB, post_code, mobile, Email )
VALUES
    ( '32006455', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07706212920', 'nastie220@yahoo.com' ),
    ( '35963960', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863324', 'nastie@hotmail.com' ), 
    ( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com' ), 
    ( '46653041', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07483888179', 'nastie2010@yahoo.com' ), 
    ( '48023677', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07483888179', 'nastie@hotmail.com' ), 
    ( '49560434', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'nastie@hotmail.com' ), 
    ( '49861032', 'Mary', 'WILSON', '1983-09-20', 'BT62JA', '07849727199', 'nastie2001@yahoo.com' ), 
    ( '53130969', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07849727199', 'Nastie@hotmail.cm' ), 
    ( '33843283', 'Mary', 'Wilson', '1983-09-20', 'BT148HU', '07484863478', 'nastie2010@yahoo.co.uk' ), 
    ( '38627975', 'Mary', 'Wilson', '1983-09-20', 'BT62JA', '07484863478', 'nastie2001@yahoo.com' ), 
    ( '84015283', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com' ), 
    ( '84069198', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559829', 'CH@hotmail.com' ), 
    ( '84070263', 'Christopher', 'Higg', '1956-01-13', 'CH2 3AZ', '07089559822', 'CHigg@AOL.com' ), 
    ( '84369603', 'Christopher', 'Higg', '1956-01-13', 'CH2 3ZA', '07089559829', 'Higg@emailme.com' ), 
    ( '85061159', 'CHRISTOPHER', 'Higg', '1956-01-13', 'CH2 3RA', '07089559829', 'CH@hotmail.com' ), 
    ( '84369605', 'Christopher', 'Higg', '1956-01-13', 'CH2 ZZZ', '07089559999', 'chrish@gmail.com' ), 
    ( '84369677', 'Christopher', 'Higg', '1956-01-13', 'AH2 ZZZ', '09089559999', 'chrish@gmail.com' ), 
    ( '87065122', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com' ),
    ( '87065123', 'Matthew', 'Davis', '1978-05-10', 'CH5 1TS', '07077084692', 'Matt@gamil.com' )

SELECT
    NID = (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    ),
    cd.*
FROM    (
    SELECT
        *,
        NID_post_code   = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, post_code),
        NID_mobile      = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, mobile),
        NID_email       = MIN(id) OVER (PARTITION BY First_Name, Last_Name, DoB, Email)
    FROM    @customer_dist
) AS cd
order BY (
        SELECT
            MIN(NID)
        FROM ( VALUES (NID_post_code), (NID_mobile), (NID_email)) AS X (NID)
    )

答案 8 :(得分:0)

这最终似乎对我来说是一个数据排名问题。考虑到这一点,我们可以使用DENSE_RANK窗口函数来确定如何将我们的帐户分组在一起。以下示例显示了如何可能完成此操作。

--header