SQL如何从2列大小写和重音不敏感的组合中找到重复项

时间:2018-05-31 11:57:13

标签: sql sql-server

表格包含来自2个输入流的信息,用户可能会在这两个输入流中出现,但略有不同。我正在尝试找到这些重复的用户。我想出了这个SQL语句,它找到了大多数这些用户:

SELECT s.PROF_MAIL, s.PROF_STATE, s.PROF_GUID, CONCAT(s.PROF_GIVEN_NAME,' ',s.PROF_SURNAME) AS FullName, t.*
FROM [EMPLOYEE] s
join (
    SELECT PROF_GIVEN_NAME,PROF_SURNAME, count(*) as qty
      FROM [EMPLOYEE] 
      GROUP BY PROF_GIVEN_NAME,PROF_SURNAME 
      HAVING count(*) > 1
    ) t on s.PROF_GIVEN_NAME = t.PROF_GIVEN_NAME AND s.PROF_SURNAME = t.PROF_SURNAME

问题在于,名字可以在一个来源中具有像René这样的重音而在另一个来源中没有。大都会也不一定相同。这些未在上述声明中捕获。因此,我试图将COLLATE Latin1_General_CI_AI纳入某处,但无法弄清楚在哪里使用它或如何以另一种方式解决这个问题。谁知道怎么做?数据库是MS SQL

2 个答案:

答案 0 :(得分:0)

首先,你应该使用窗口函数:

select e.*
from (select e.*,
             count(*) over (partition by prof_given_name, prof_surname) as cnt
      from employees e
     ) e
where cnt > 1;

您现在可以在collate子句中加入partition by

select e.*
from (select e.*,
             count(*) over (partition by prof_given_name collate Latin1_General_CI_AI, prof_surname collate Latin1_General_CI_AI) as cnt
      from employees e
     ) e
where cnt > 1;

答案 1 :(得分:0)

您可以使用ROW_NUMBER窗口函数和PARTITION BY中的名称一样(也包含COLLATE)

;WITH cteDups
AS(
    SELECT
        *,RN=ROW_NUMBER()OVER(PARTITION BY 
                                  PROF_GIVEN_NAME COLLATE Latin1_General_CI_AI, 
                                  PROF_SURNAME COLLATE Latin1_General_CI_AI 
                        ORDER BY PROF_SURNAME ASC )
    FROM    dbo.Employee
)
SELECT * FROM cteDups WHERE cteDups.RN > 1

如果EMPLOYEE表格中有DATETIME列记录了创建行的时间,请将ORDER BY替换为该列,以便您可以识别最新记录