SQL Server:按“弱”标准查找和分组重复项

时间:2014-11-06 10:35:46

标签: sql sql-server linq group-by duplicates

我尝试从Person表中列出并分组一些可能的重复项。

架构如下所示:

Id    LastName      OriginalName    FirstName
---------------------------------------------
1     Nolte         Huber           Silvia
2     Nolte                         Johann
3     Huber                         Milan
4     Huber                         Silvia
5     Abacherli                     Adrian
6     Abächerli                     Adrian    
7     Meier                         Hans
8     Meier                         Urs
9     Meyer                         Hans
10    Meier                         Urs
11    Hermann                       Marco
12    Huber                         Milan
13    Meyer                         Hans    

预期结果:

GroupNumber   Id    LastName      OriginalName    FirstName 
-----------------------------------------------------------
1             5     Abacherli                     Adrian
1             6     Abächerli                     Adrian  
2             3     Huber                         Milan
2             12    Huber                         Milan
3             4     Huber                         Silvia
3             1     Nolte         Huber           Silvia
4             7     Meier                         Hans
4             9     Meyer                         Hans
4             13    Meyer                         Hans
5             8     Meier                         Urs
5             10    Meier                         Urs

说明:

我想对匹配的行进行分组,并将它们列在Web应用程序的网格中(ASP.NET MVC)。考虑重复的内容必须至少包含:

  • 相同LastName且相同FirstName
  • LastName,例如OrginalNameFirstName

为了使事情更复杂,“相同”意味着语音匹配(即通过SOUNDEX或类似功能):Meyer == Meier == meier

使用中的技术:

  • Microsoft SQL Server 2012
  • Telerik DataAccess ORM
  • .NET Framework 4.5,C#

预期答案:

  • 纯SQL查询或
  • 存储过程或
  • C#
  • 中ORM的SQL查询/ SP和LINQ查询的组合

到目前为止,我已经制定了所有方法,但错过了GroupNumber。这是一个(非工作)查询:

SELECT 
    Id, LastName, FirstName 
FROM 
    Person p1,
    (SELECT
     p1.Id AS Id1
     FROM Person p1
     INNER JOIN Person p2
     ON (p1.LastName LIKE p2.LastName OR p1.LastName LIKE p2.OriginalName) AND p1.FirstName LIKE p2.FirstName AND p1.Id <> p2.Id
     GROUP BY p1.Id
     HAVING COUNT(*) > 1) AS p2
WHERE 
    p1.Id IN (SELECT Id1)
ORDER BY
    p1.LastName, FirstName, Id

2 个答案:

答案 0 :(得分:1)

这个怎么样:

SQL Fiddle

MS SQL Server 2012架构设置

CREATE TABLE Person
( ID Int,
  LastName Varchar(50),
  OriginalName Varchar(50),
  FirstName varchar(50)
)

INSERT INTO Person
VALUES
  (1, 'Nolte', 'Huber','Silvia'),
  (2,'Nolte', '', 'Johann'),
  (3,'Huber', '', 'Milan'),
  (4,'Huber', '', 'Silvia'),
  (5,'Abacherli', '', 'Adrian'),
  (6,'Abacherli', '', 'Adrian'),
  (7,'Meier', '', 'Hans'),
  (8,'Meier', '', 'Urs'),
  (9,'Meyer', '', 'Hans'),
  (10,'Meier', '', 'Urs'),
  (11,'Hermann', '', 'Marco'),
  (12,'Huber', '', 'Milan'),
  (13,'Meyer', '', 'Hans')

查询1

;WITH PersonCTE
AS
(
    SELECT ID, SOUNDEX(LastName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName
    FROM Person
    UNION ALL
    SELECT ID, SOUNDEX(OriginalName) AS LastNameSDX, LastName, OriginalName, SOUNDEX(FirstName) FirstNameSDX, FirstName
    FROM Person
    WHERE OriginalName <> ''
),
PersonRankCTE
AS
(
    SELECT DENSE_RANK() OVER (ORDER BY LastNameSDX, FirstNameSdx) AS Grp, * 
    FROM PersonCTE
)
SELECT DENSE_RANK() OVER(ORDER BY grp) AS Grp, ID, LastName, OriginalName, FirstName
FROM PersonRankCTE P1
WHERE (SELECT COUNT(*) FROM PersonRankCTE P2 WHERE P1.grp = P2.grp) > 1

<强> Results

| GRP | ID |  LASTNAME | ORIGINALNAME | FIRSTNAME |
|-----|----|-----------|--------------|-----------|
|   1 |  5 | Abacherli |              |    Adrian |
|   1 |  6 | Abacherli |              |    Adrian |
|   2 |  3 |     Huber |              |     Milan |
|   2 | 12 |     Huber |              |     Milan |
|   3 |  1 |     Nolte |        Huber |    Silvia |
|   3 |  4 |     Huber |              |    Silvia |
|   4 | 13 |     Meyer |              |      Hans |
|   4 |  9 |     Meyer |              |      Hans |
|   4 |  7 |     Meier |              |      Hans |
|   5 |  8 |     Meier |              |       Urs |
|   5 | 10 |     Meier |              |       Urs |

答案 1 :(得分:0)

也许(可能?)过于复杂,但是......

我制作了两个CTE

1获取具有相应Soundex LastName和OriginalName的所有Person字段

1创建组并获取GroupNumber。在1&#34;列&#34;上创建一个联盟全部,&#34; soundexed&#34; LastName和OriginalName(仅采用重复项)

所以

with cte as (select 
                   id, 
                   LastName, 
                   OriginalName, 
                   soundex(LastName) as sdxLastName, 
                   soundex(OriginalName) as sdxOriginalName, 
                   FirstName 
             from Person),

     grp as (select lname, FirstName, row_number() over(order by lname) rn  
             from (
                  select 
                    sdxOriginalName as lname, 
                    FirstName from cte
                  where sdxOriginalName is not null
                  union all 
                  select 
                      sdxLastName as lname, 
                      FirstName from cte) s
                group by lname, FirstName
              having count(*) > 1)
select 
    g.rn as GroupNumber,
    p.Id,
    p.LastName,
    p.OriginalName,
    p.FirstName
from grp g
join cte p on p.firstName = g.FirstName and 
    (sdxLastName = g.lname or sdxOriginalName = g.lname)
order by rn

请参阅Sqlfiddle