sql查询以查找潜在的重复记录

时间:2017-05-17 19:05:23

标签: mysql sql

我正在研究雇主的数据,根据他们的名字找出重复的雇主。

数据是这样的:

Employer ID   |   Legal Name   |    Operating Name
------------- | ---------------| --------------------
1             |      AA        |        AA
2             |      BB        |        AA
3             |      CC        |        BB
4             |      DD        |        DD
5             |      ZZ        |        ZZ

现在,如果我尝试查找雇主AA的所有重复项,则查询应返回以下结果:

Employer ID   |   Legal Name   |    Operating Name
------------- | ---------------| --------------------
1             |      AA        |        AA
2             |      BB        |        AA
3             |      CC        |        BB

雇主1的法定名称和雇主2的经营名称与搜索直接匹配。 但是捕获的是雇主3,它与搜索字符串没有直接关系,但雇主2的法定名称与雇主3的经营名称相匹配。

我需要搜索结果达到第n级。我不确定是否可以通过递归查询这样的东西来实现。

请帮忙

我试图通过递归CTE实现这一点但后来我意识到它将进入无限递归。这是代码:

    DECLARE @SearchName VARCHAR(50)
    SET @SearchName = 'AA'
    ;With CTE_EmployerNames
    AS
    (
-- Anchor Member definition
select  * 
from    [dbo].[Name_Table]
where   Leg_Name = @SearchName 
OR      Op_Name = @SearchName 
UNION ALL
-- Recursive Member definition
select  N.*
from    [dbo].[Name_Table] N
JOIN    CTE_EmployerNames  C
ON      N.ID <> C.ID
AND     (N.Leg_Name = C.Leg_Name
OR      N.Leg_Name = C.Op_Name
OR      N.Op_Name = C.Leg_Name
OR      N.Op_Name = C.Op_Name)
    )

    select  * 
    from    CTE_EmployerNames

更新: 我创建了一个存储过程来实现我想要的。但是由于循环和游标,这个过程有点慢。截至目前,这通过执行时间的微小妥协解决了我的问题。任何建议,以优化它或其他方式来做到这一点将受到高度赞赏。多谢你们。这是代码:

CREATE PROCEDURE [dbo].[Get_Similar_Name_Employers] 
@P_BaseName VARCHAR(100)
AS
BEGIN
DECLARE @ID INT
DECLARE @Leg_Name VARCHAR(50)
DECLARE @Op_Name VARCHAR(50)

-- Create temp table to hold data temporarily
CREATE TABLE #Temp_Employers
(
    [ID] [int] NULL,
    [Leg_Name] [varchar](50) NULL,
    [Op_Name] [varchar](50) NULL,
    [Status] [bit] null -- To keep track if that record is processed or not
)

-- Insert all records which are directly matching with search criteria
INSERT INTO #Temp_Employers
SELECT  NT.ID, NT.Leg_Name, NT.Op_Name, 0
FROM    dbo.Name_Table NT
WHERE   NT.Leg_Name = @P_BaseName 
OR      NT.Op_Name = @P_BaseName 

while EXISTS (SELECT 1 from #Temp_Employers where Status = 0) -- until all rows are processed
BEGIN
    DECLARE @EmployerCursor CURSOR
    SET     @EmployerCursor = CURSOR FAST_FORWARD
    FOR
            SELECT  ID, Leg_Name, Op_Name  
            from    #Temp_Employers 
            where   Status = 0

    OPEN    @EmployerCursor

    FETCH   NEXT 
    FROM    @EmployerCursor
    INTO    @ID, @Leg_Name, @Op_Name

    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- For every unprocessed record in temp table check if there is any possible duplicate.
        -- and insert all possible duplicate records in same table for further processing to find their possible duplicates     
        INSERT  INTO #Temp_Employers
        select  ID, Leg_Name, Op_Name, 0 
        from    dbo.Name_Table 
        WHERE   (Leg_Name = @Leg_Name 
        OR      Op_Name = @Op_Name 
        OR      Leg_Name = @Op_Name 
        OR      Op_Name = @Leg_Name)
        AND     ID NOT IN ( select  ID 
                            FROM    #Temp_Employers) 

        -- Update status of recently processed record to avoid processing again
        UPDATE  #Temp_Employers
        SET     Status = 1
        WHERE   ID = @ID

        FETCH   NEXT 
        FROM    @EmployerCursor
        INTO    @ID, @Leg_Name, @Op_Name
    END

    -- close cursor and deallocate memory
    CLOSE @EmployerCursor
    DEALLOCATE @EmployerCursor
END

select  ID,
        Leg_Name,
        Op_Name 
from    #Temp_Employers 
Order By ID

DROP TABLE #Temp_Employers 

END

2 个答案:

答案 0 :(得分:0)

您可以使用两个自联接来执行此操作。我使用DISTINCT是安全的 - 您不需要它作为您的示例,但可能会用于您的实际数据:

SELECT DISTINCT T2.EMPID, T2.LEGAL_NAME, T.LEGAL_NAME
FROM TABLE T
INNER JOIN TABLE T2 ON T.LEGAL_NAME = T2.OPERATING_NAME
INNER JOIN TABLE T3 ON T2.OPERATING_NAME = T3.OPERATING_NAME
WHERE T.LEGAL_NAME <> T3.LEGAL_NAME

根据需要重命名和别名表格和列。

SQL Fiddle Example

编辑 - 如果您还想要操作名称与法定名称完全不同的记录,UNION位于:

SELECT DISTINCT T2.EMPID, T2.LEGAL_NAME, T.LEGAL_NAME
FROM TABLE T
INNER JOIN TABLE T2 ON T.LEGAL_NAME = T2.OPERATING_NAME
INNER JOIN TABLE T3 ON T2.OPERATING_NAME = T3.OPERATING_NAME
WHERE T.LEGAL_NAME <> T3.LEGAL_NAME

UNION

SELECT EMPID, LEGAL_NAME, OP_NAME
FROM TABLE
WHERE LEGAL_NAME <> OP_NAME

SQL Fiddle Example 2

答案 1 :(得分:0)

您基本上是在尝试构建directed acyclic graph,其中节点是名称,并且您希望找到通向您的员工的所有名称。

Oracle Tip: Solving directed graph problems with SQL, part 1 上有一个开始教程, Directed graph SQL 上有一个相关的StackOverflow问题。