如何合成连接表

时间:2016-11-10 13:59:59

标签: sql sql-server tsql

我的视图定义如下:

CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts
FROM  [dbo].AllContracts AS C
    INNER JOIN [dbo].AllContracts AS CC
        ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
            OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
    (
        SELECT UniqueID FROM [dbo].DefinitiveMatches
    )
    AND C.AssociatedUser IS NULL
    AND C.UniqueID <> CC.UniqueID

基本上找到f.e.的合同。第一个名字和生日是匹配的。这非常有效。现在我想为每一行添加一个合成属性,其中只有一个源行的值。

让我举个例子来说明一点。假设我有下表:

UniqueID  | FirstName | LastName  | Birthday

1         | Peter     | Smith     | 1980-11-04
2         | Peter     | Gray      | 1980-11-04
3         | Peter     | Gray-Smith| 1980-11-04
4         | Frank     | May       | 1985-06-09
5         | Frank-Paul| May       | 1985-06-09
6         | Gina      | Ericson   | 1950-11-04

结果视图应如下所示:

UniqueID | PossiblyMatchingContracts | SyntheticID

1        | 2                         | PeterSmith1980-11-04
1        | 3                         | PeterSmith1980-11-04
2        | 1                         | PeterSmith1980-11-04
2        | 3                         | PeterSmith1980-11-04
3        | 1                         | PeterSmith1980-11-04
3        | 2                         | PeterSmith1980-11-04
4        | 5                         | FrankMay1985-06-09
5        | 4                         | FrankMay1985-06-09
6        | NULL                      | NULL [or] GinaEricson1950-11-04

请注意,SyntheticID列仅使用来自其中一个匹配源行的值。哪件事并不重要。我将此视图导出到另一个应用程序,并且需要能够识别每个&#34;匹配组&#34;然后。

我明白我的意思吗?有什么想法可以在sql中完成吗?

也许有助于详细说明实际用例:

我正在从不同系统导入合同。为了解释打字错误的可能性或已结婚的人但姓氏只在一个系统中更新,我需要找到所谓的“可能的匹配”。如果两个或多个合同包含相同的生日加上相同的第一个,最后一个或出生名称,则认为它们可能匹配。这意味着,如果合同A与合同B匹配,则合同B也与合同A匹配。

目标系统使用多值引用属性来存储这些关系。最终目标是为这些合同创建用户对象。首先要注意的是,对于多个匹配的合同,它应该只是一个用户对象。因此,我在视图中创建这些匹配项。第二个问题是,用户对象的创建是通过工作流实现的,工作流为每个合同并行运行。为了避免为匹配的合同创建多个用户对象,每个工作流需要检查是否已经存在匹配的用户对象或另一个工作流,即将创建所述用户对象。因为与sql相比,工作流引擎非常慢,所以工作流不应重复整个匹配测试。因此,我们的想法是让工作流程只检查“合成ID”。

7 个答案:

答案 0 :(得分:3)

我用多步骤方法解决了这个问题:

  1. 为每份合约创建可能的第一级匹配列表
  2. 创建基本组列表,为for分配不同的组 每份合同(好像与任何人无关)
  3. 在需要更多合同时,迭代更新组列表的匹配列表 被添加到一个组
  4. 从最终组列表中递归构建SyntheticID
  5. 输出结果
  6. 首先,让我解释一下我所理解的内容,以便判断我的方法是否正确。

    1)匹配在&#34;级联&#34;

    中传播

    我的意思是,如果&#34; Peter Smith&#34;与彼得格雷&#34;分组,这意味着所有史密斯和所有格雷都是相关的(如果他们有相同的出生日期),所以卢克史密斯可以在同一组约翰格雷

    2)我不明白你的意思&#34;出生名称&#34;

    你说合同匹配&#34;第一个,最后一个或出生名称&#34;,对不起,我是意大利人,我认为出生名称和第一个是相同的,也在你的数据中没有这样的列。也许它与名字之间的短划线符号有关? 当FirstName是Frank-Paul时,它意味着它应该匹配Frank和Paul吗? 当LastName是Gray-Smith时,它意味着它应该匹配Gray和Smith?

    在下面的代码中,我简单地忽略了这个问题,但是如果需要可以处理它(我已经尝试过,打破了名称,将它们解开并将其视为双重匹配)。

    第零步:一些声明和准备基础数据

    declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
    declare @comb as table (id1 int, id2 int, done bit)
    declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
    declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
    declare @id1 as int, @g int
    
    ;with
    t as (
        select *
        from (values
        (1         , 'Peter'     , 'Smith'     , '1980-11-04'),
        (2         , 'Peter'     , 'Gray'      , '1980-11-04'),
        (3         , 'Peter'     , 'Gray-Smith', '1980-11-04'),
        (4         , 'Frank'     , 'May'       , '1985-06-09'),
        (5         , 'Frank-Paul', 'May'       , '1985-06-09'),
        (6         , 'Gina'      , 'Ericson'   , '1950-11-04')
        ) x (UniqueID  , FirstName , LastName  , Birthday)
    )
    insert into @cli
    select * from t
    

    第一步:为每份合约创建可能的第一级匹配列表

    ;with
    p as(select UniqueID, Birthday, FirstName, LastName from @cli),
    m as (
        select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
        from p
        join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
        where p.UniqueID<=pp.UniqueID
    )
    insert into @comb
    select UniqueID1,UniqueID2,0
    from m
    

    第二步:创建基本组列表

    insert into @grp
    select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2
    

    第三步:迭代更新组列表的匹配列表 只在需要匹配和更新的合同上循环

    set @id1 = 0
    while not(@id1 is null) begin
        set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)
    
        if not(@id1 is null) begin
    
            set @g = (select grp from @grp where id=@id1)
            update g set grp= @g
            from @grp g
            inner join @comb c on g.id = c.id2
            where c.id2<>@id1 and c.id1=@id1
            and grp<>@g
    
            update @comb set done=1 where id1=@id1
        end
    end
    

    第四步:构建SyntheticID 递归地将组的所有(不同的)名字和姓氏添加到SyntheticID 我使用了&#39; _&#39;作为出生日期的分隔符,名字和姓氏,以及&#39;,&#39;作为名称列表的分隔符以避免冲突。

    ;with
    c as(
        select c.*, g.grp
        from @cli c
        join @grp g on g.id = c.UniqueID
    ),
    d as (
        select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
        from (
            select distinct c.grp g, 1 t, FirstName s from c
            union 
            select distinct c.grp, 2, LastName from c 
            ) l
    ),
    r as (
        select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
        from d
        join @cli t on t.UniqueID=d.g
        where n1=1
        union all
        select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') +  d.s as varchar(1000)), r.n1, r.n2
        from d
        join r on r.g = d.g and r.n1=d.n1-1 
    )
    insert into @str_id 
    select g, Names
    from r
    where n2=1
    

    第五步:输出结果

    select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
    from @cli c
    left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
    left join @grp g on c.UniqueID = g.id
    left join @str_id s on s.grp = g.grp
    

    以下是结果

    UniqueID    PossibleMatchingContract    SyntheticID
    1           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    1           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    2           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    2           3                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    3           1                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    3           2                           1980-11-04_Peter_Gray,Gray-Smith,Smith
    4           5                           1985-06-09_Frank,Frank-Paul_May
    5           4                           1985-06-09_Frank,Frank-Paul_May
    6           NULL                        1950-11-04_Gina_Ericson
    

    我认为通过这种方式生成的SyntheticID也应该是&#34; unique&#34;对于每个小组

答案 1 :(得分:1)

这会创建一个合成值,并且很容易根据您的需要进行更改。

DECLARE @T TABLE (
    UniqueID INT
    ,FirstName VARCHAR(200)
    ,LastName  VARCHAR(200)
    ,Birthday DATE
)

INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 1,'Peter','Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 2,'Peter','Gray','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 3,'Peter','Gray-Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 4,'Frank','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 5,'Frank-Paul','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 6,'Gina','Ericson','1950-11-04'

DECLARE @PossibleMatches TABLE (UniqueID INT,[PossibleMatch] INT,SynKey VARCHAR(2000)
)

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Ln=' + t1.LastName + ' Fn=' +  + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.FirstName=t2.FirstName
        AND t1.LastName=t2.LastName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Fn=' + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.FirstName=t2.FirstName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID,t2.UniqueID,'Ln=' + t1.LastName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
        AND t1.LastName=t2.LastName
        AND t1.UniqueID<>t2.UniqueID

INSERT INTO @PossibleMatches
    SELECT t1.UniqueID,pm.UniqueID,'Ln=' + t1.LastName + ' Fn=' +  + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
    FROM @T t1
    LEFT JOIN @PossibleMatches pm on pm.UniqueID=t1.UniqueID
    WHERE pm.UniqueID IS NULL

SELECT *
FROM @PossibleMatches
ORDER BY UniqueID,[PossibleMatch]

答案 2 :(得分:1)

我认为这对你有用

SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday) 
          OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM 
    [dbo].AllContracts AS C INNER JOIN
    [dbo].AllContracts AS CC ON
        C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE 
    C.UniqueID NOT IN(
    SELECT UniqueID FROM [dbo].DefinitiveMatches)
AND C.AssociatedUser IS NULL

答案 3 :(得分:0)

你可以试试这个:

SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday) 
          OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM 
    [dbo].AllContracts AS C
INNER JOIN
    [dbo].AllContracts AS CC
ON
        C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
    OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
    OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
    OR
        C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
    OR
        C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE 
    C.UniqueID NOT IN
    (
        SELECT UniqueID FROM [dbo].DefinitiveMatches
    )
AND
    C.AssociatedUser IS NULL

这会产生一个额外的行(因为我们遗漏了C.UniqueID&lt;&gt; CC.UniqueID)但会给你一个好的灵魂。

答案 4 :(得分:0)

下面是一个示例,其中包含从原始帖子中提取的一些示例数据。想法:在CTE中生成所有SyntheticID,使用“PossibleMatch”查询所有记录,并将其与所有尚未包含的记录联合起来:

DECLARE @t TABLE(
  UniqueID int
 ,FirstName nvarchar(20)
 ,LastName nvarchar(20)
 ,Birthday datetime
)

INSERT INTO @t VALUES (1, 'Peter', 'Smith', '1980-11-04');
INSERT INTO @t VALUES (2, 'Peter', 'Gray', '1980-11-04');
INSERT INTO @t VALUES (3, 'Peter', 'Gray-Smith', '1980-11-04');
INSERT INTO @t VALUES (4, 'Frank', 'May', '1985-06-09');
INSERT INTO @t VALUES (5, 'Frank-Paul', 'May', '1985-06-09');
INSERT INTO @t VALUES (6, 'Gina', 'Ericson', '1950-11-04');


WITH ctePrep AS(
SELECT UniqueID, FirstName, LastName, BirthDay,
       ROW_NUMBER() OVER (PARTITION BY FirstName, BirthDay ORDER BY FirstName, BirthDay) AS k,
       FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
  FROM @t
),
cteKeys AS(
SELECT FirstName, BirthDay, SyntheticID
  FROM ctePrep
  WHERE k = 1
),
cteFiltered AS(
SELECT 
    C.UniqueID,
    CC.UniqueID AS PossiblyMatchingContracts,
    keys.SyntheticID
FROM @t AS C
JOIN @t AS CC ON C.FirstName = CC.FirstName
              AND C.Birthday = CC.Birthday
JOIN cteKeys AS keys ON keys.FirstName = c.FirstName
                  AND keys.Birthday = c.Birthday
WHERE C.UniqueID <> CC.UniqueID
)
SELECT UniqueID, PossiblyMatchingContracts, SyntheticID
  FROM cteFiltered
UNION ALL
SELECT UniqueID, NULL, FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
  FROM @t
  WHERE UniqueID NOT IN (SELECT UniqueID FROM cteFiltered)

希望这会有所帮助。结果对我来说很好看:

UniqueID    PossiblyMatchingContracts   SyntheticID
---------------------------------------------------------------
2           1                           PeterSmith1980-11-04
3           1                           PeterSmith1980-11-04
1           2                           PeterSmith1980-11-04
3           2                           PeterSmith1980-11-04
1           3                           PeterSmith1980-11-04
2           3                           PeterSmith1980-11-04
4           NULL                        FrankMay1985-06-09
5           NULL                        Frank-PaulMay1985-06-09
6           NULL                        GinaEricson1950-11-04

答案 5 :(得分:0)

在SSMS中测试,它完美无缺。 :)

--create table structure
create table #temp
(
    uniqueID int,
    firstname varchar(15),
    lastname varchar(15),
    birthday date
)

--insert data into the table
insert #temp
select 1, 'peter','smith','1980-11-04'
union all
select 2, 'peter','gray','1980-11-04'
union all
select 3, 'peter','gray-smith','1980-11-04'
union all
select 4, 'frank','may','1985-06-09'
union all
select 5, 'frank-paul','may','1985-06-09'
union all
select 6, 'gina','ericson','1950-11-04'

select * from #temp

--solution is as below

select ab.uniqueID
, PossiblyMatchingContracts
, c.firstname+c.lastname+cast(c.birthday as varchar) as synID
from
(
    select a.uniqueID
            , case 
                when  a.uniqueID < min(b.uniqueID)over(partition by a.uniqueid)
                    then a.uniqueID
                else min(b.uniqueID)over(partition by a.uniqueid)
            end as SmallestID
            , b.uniqueID as PossiblyMatchingContracts
        from #temp a
        left join #temp b
        on (a.firstname = b.firstname OR a.lastname = b.lastname) AND a.birthday = b.birthday AND a.uniqueid <> b.uniqueID
) as ab
left join #temp c
on ab.SmallestID = c.uniqueID

结果捕获如下:

enter image description here

答案 6 :(得分:0)

假设我们有下表(在您的情况下为VIEW):

/timer

在您的情况下,您可以为每行使用UniqueID PossiblyMatchingContracts SyntheticID 1 2 G1 1 3 G2 2 1 G3 2 3 G4 3 1 G4 3 4 G6 4 5 G7 5 4 G8 6 NULL G9 将初始SyntheticID设置为PeterSmith1980-11-04之类的字符串。这是一个递归CTE查询,它将所有行划分为未连接的组,并在当前组中选择UniqueID作为该组中所有行的新MAX(SyntheticId)

SyntheticID