一个非常复杂的SQL查询问题

时间:2011-11-04 19:35:35

标签: sql-server sql-server-2008 sql-server-2008-r2

我有2张桌子......

  • 客户
  • CustomerIdentification

客户表有2个字段

  • CustomerId varchar(20)
  • Customer_Id_Link varchar(50)

CustomerIdentification表有3个字段

  • CustomerId varchar(20)
  • Identification_Number varchar(50)
  • Personal_ID_Type_Code int - 是另一个表的外键但不相关

基本上,Customer是客户主表(CustomerID作为主键),CustomerIdentification可以为给定客户提供多个标识。换句话说,CustomerIdentification中的CustomerId是Customer表的foriegn键。客户可以拥有许多标识,每个标识都有Identification_NumberPersonal_ID_Type_Code(这是一个整数,告诉您标识是护照,罪过,驾驶执照等)。

现在,customer表包含以下数据:此时Customer_Id_Link为空(空字符串)

CustomerId      Customer_Id_Link
--------------------------------
 'CU-1'         <Blank>
 'CU-2'         <Blank>
 'CU-3'         <Blank>
 'CU-4'         <Blank>
 'CU-5'         <Blank>

和CustomerIdentification表具有以下数据:

CustomerId    Identification_Number    Personal_ID_Type_Code
------------------------------------------------------------
'CU-1'        'A'                      1
'CU-1'        'A'                      2
'CU-1'        'A'                      3
'CU-2'        'A'                      1
'CU-2'        'B'                      3
'CU-2'        'C'                      4
'CU-3'        'A'                      1
'CU-3'        'B'                      2
'CU-3'        'C'                      4
'CU-4'        'A'                      1
'CU-4'        'B'                      2
'CU-4'        'B'                      3
'CU-5'        'B'                      3

基本上,多个客户可以在Identification_Number中拥有相同的Personal_ID_Type_CodeCustomerIdentification。发生这种情况时,所有Customer_Id_Link字段都需要使用公共值(可以是GUID或其他)进行更新。但对此的处理更复杂。

规则如下:

用于在客户记录之间匹配Personal_ID_Type_CodeIdentification_Number字段     - 比较上述匹配中所有客户记录的所有其他常见Identification_Number字段的Personal_ID_Type_Code字段     - 如果为true,则链接客户记录

例如:

对于CU-1,CU-2,CU-3,CU-4匹配ID 1 A

  • 异常ID 2不匹配(CU-1上的A与CU-3上的B)
  • 没有完成连接

匹配ID 2 B用于CU-3,CU-4

  • 无ID不匹配
  • 链接CU-3和CU-4(更新Customer_Id_Link字段,其中包含客户表中的公用值)

CU-1,CU-4匹配ID 3 A

  • 例外ID 2不匹配(A与B)
  • 没有完成连接

比赛ID 3 B代表CU-2,CU-5

  • 无ID不匹配
  • 链接CU-2和CU-5(更新Customer_Id_Link字段,其中包含客户表中的公共值两者)匹配ID 4 C表示CU-2,CU-3
  • CU-2已经链接,保留CU-5到客户链接列表
  • CU-3已经链接,保留CU-4到客户链接列表
  • 异常ID 3不匹配(CU-2上的B与CU-4上的A)
  • 没有完成连接(之前的连接仍然存在)

任何帮助将不胜感激。这让我在两天内保持清醒,似乎无法找到解决方案。理想情况下,解决方案将是我可以执行以执行客户链接的存储过程。

- SQL Server 2008 R2标准版64位

UPDATE -------------------------------

我知道要解释这个问题很难,所以我要承担责任。但实际上,我希望能够链接所有具有相同标识号的客户,只有客户可以拥有多于1个的identificationNumber。举例1. 1 A(1是Personal_id_type_code,A是4个不同客户的识别号码.CU-1,CU-2,CU-3,CU-4。所以他们可能是同一个客户,存在4个不同的时间具有不同客户ID的客户表。我们需要将它们与1个公共值链接。但是,CU-1还有2个其他标识,如果其中1个与其他3个不同(CU-2,CU-3,CU-4) )它们不是同一个客户。因此,带有Num A的ID 2与CU-3(其B)的ID 2不匹配,对于CU-4则不相同。此外,即使在CU-2中不存在ID 2 num A ,CU-1的ID 3和num A与CU-2s ID 3(其B)不匹配。因此它根本不匹配。

下一个公共Id和num是2-b,存在于CU-3和CU-4中。这两个客户实际上是相同的,因为它们都具有ID 1-A和ID 2-B.ID 4-C和ID 3 -A是无关的,因为两个ID都不同。这实际上意味着该客户有4个ID I A,2 B,4 C和3 A.因此,现在我们需要将此客户与客户表中的公共唯一值(guid)相关联。

我希望我现在解释这个非常复杂的问题。很难解释,因为这是一个非常独特的问题。

1 个答案:

答案 0 :(得分:2)

我已经改变了你的数据模型,试着让它更明显地发生了什么......

CREATE TABLE [dbo].[Customer]
(
    [CustomerName]      VARCHAR(20)     NOT NULL,
    [CustomerLink]      VARBINARY(20)   NULL
)

CREATE TABLE [dbo].[CustomerIdentification]
(
    [CustomerName]      VARCHAR(20)     NOT NULL,
    [ID]                VARCHAR(50)     NOT NULL,
    [IDType]            VARCHAR(16)     NOT NULL
)

我已经添加了一些测试数据..

INSERT  [dbo].[Customer]
        ([CustomerName])
VALUES  ('Fred'),
        ('Bob'),
        ('Vince'),
        ('Tom'),
        ('Alice'),
        ('Matt'),
        ('Dan')

INSERT  [dbo].[CustomerIdentification]
VALUES  
        ('Fred',    'A',    'Passport'),
        ('Fred',    'A',    'SIN'),
        ('Fred',    'A',    'Drivers Licence'),
        ('Bob',     'A',    'Passport'),
        ('Bob',     'B',    'Drivers Licence'),
        ('Bob',     'C',    'Credit Card'),
        ('Vince',   'A',    'Passport'),
        ('Vince',   'B',    'SIN'),
        ('Vince',   'C',    'Credit Card'),
        ('Tom',     'A',    'Passport'),
        ('Tom',     'B',    'SIN'),
        ('Tom',     'B',    'Drivers Licence'),
        ('Alice',   'B',    'Drivers Licence'),
        ('Matt',    'X',    'Drivers Licence'),
        ('Dan',     'X',    'Drivers Licence')

这就是你要找的东西:

;WITH [cteNonMatchingIDs] AS (
    -- Pairs where the IDType is the same, but 
    -- name and ID don't match
    SELECT  ci3.[CustomerName] AS [CustomerName1],
            ci4.[CustomerName] AS [CustomerName2]
    FROM [dbo].[CustomerIdentification] ci3
    INNER JOIN [dbo].[CustomerIdentification] ci4
        ON ci3.[IDType] = ci4.[IDType]
    WHERE ci3.[CustomerName] <> ci4.[CustomerName]
    AND ci3.[ID] <> ci4.[ID]
),
[cteMatchedPairs] AS (
    -- Pairs where the IDType and ID match, and
    -- there aren't any non matching IDs for the
    -- CustomerName
    SELECT DISTINCT 
            ci1.[CustomerName] AS [CustomerName1],
            ci2.[CustomerName] AS [CustomerName2]
    FROM [dbo].[CustomerIdentification] ci1
    LEFT JOIN [dbo].[CustomerIdentification] ci2
        ON ci1.[CustomerName] <> ci2.[CustomerName]
        AND ci1.[IDType] = ci2.[IDType] 
    WHERE ci1.[ID] = ISNULL(ci2.[ID], ci1.[ID])
    AND NOT EXISTS (
        SELECT 1
        FROM [cteNonMatchingIDs]
        WHERE ci1.[CustomerName] = [CustomerName1] -- correlated subquery
        AND ci2.[CustomerName] = [CustomerName2]
    )
    AND ci1.[CustomerName] < ci2.[CustomerName]
),
[cteMatchedList] ([CustomerName], [CustomerNameList]) AS (
    -- Turn the matched pairs into list of matching
    -- CustomerNames
    SELECT  [CustomerName1],
            [CustomerNameList]
    FROM (
        SELECT  [CustomerName1],
                CONVERT(VARCHAR(1000), '$'
                 + [CustomerName1] + '$'
                 + [CustomerName2]) AS [CustomerNameList]
        FROM [cteMatchedPairs]
        UNION ALL
        SELECT  [CustomerName2],
                CONVERT(VARCHAR(1000), '$'
                 + [CustomerName2]) AS [CustomerNameList]
        FROM [cteMatchedPairs]
    ) [cteMatchedPairs]
    UNION ALL
    SELECT  [cteMatchedList].[CustomerName],
            CONVERT(VARCHAR(1000),[CustomerNameList] + '$'
             + [cteMatchedPairs].[CustomerName2])
    FROM [cteMatchedList] -- recursive CTE
    INNER JOIN [cteMatchedPairs]
        ON RIGHT([cteMatchedList].[CustomerNameList],
         LEN([cteMatchedPairs].[CustomerName1])
        ) = [cteMatchedPairs].[CustomerName1]
),
[cteSubstringLists] AS (
    SELECT  r1.[CustomerName],
            r2.[CustomerNameList]
    FROM [cteMatchedList] r1
    INNER JOIN [cteMatchedList] r2
        ON r2.[CustomerNameList] LIKE '%' + r1.[CustomerNameList] + '%'
),
[cteCustomerLink] AS (
    SELECT DISTINCT 
            x1.[CustomerName],
            HASHBYTES('SHA1', x2.[CustomerNameList]) AS [CustomerLink]
    FROM (
        SELECT  [CustomerName],
                MAX(LEN([CustomerNameList])) AS [MAX LEN CustomerList]
        FROM [cteSubstringLists]
        GROUP BY [CustomerName]
    ) x1
    INNER JOIN (
        SELECT  [CustomerName],
                LEN([CustomerNameList]) AS [LEN CustomerList], 
                [CustomerNameList]
        FROM [cteSubstringLists]
    ) x2
        ON x1.[MAX LEN CustomerList] = x2.[LEN CustomerList]
        AND x1.[CustomerName] = x2.[CustomerName]
)
UPDATE  c
SET     [CustomerLink] = cl.[CustomerLink]
FROM [dbo].[Customer] c
INNER JOIN [cteCustomerLink] cl
    ON cl.[CustomerName] = c.[CustomerName]


SELECT *
FROM [dbo].[Customer]