我的视图定义如下:
CREATE VIEW [dbo].[PossiblyMatchingContracts] AS
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts
FROM [dbo].AllContracts AS C
INNER JOIN [dbo].AllContracts AS CC
ON C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND C.AssociatedUser IS NULL
AND C.UniqueID <> CC.UniqueID
基本上找到f.e.的合同。第一个名字和生日是匹配的。这非常有效。现在我想为每一行添加一个合成属性,其中只有一个源行的值。
让我举个例子来说明一点。假设我有下表:
UniqueID | FirstName | LastName | Birthday
1 | Peter | Smith | 1980-11-04
2 | Peter | Gray | 1980-11-04
3 | Peter | Gray-Smith| 1980-11-04
4 | Frank | May | 1985-06-09
5 | Frank-Paul| May | 1985-06-09
6 | Gina | Ericson | 1950-11-04
结果视图应如下所示:
UniqueID | PossiblyMatchingContracts | SyntheticID
1 | 2 | PeterSmith1980-11-04
1 | 3 | PeterSmith1980-11-04
2 | 1 | PeterSmith1980-11-04
2 | 3 | PeterSmith1980-11-04
3 | 1 | PeterSmith1980-11-04
3 | 2 | PeterSmith1980-11-04
4 | 5 | FrankMay1985-06-09
5 | 4 | FrankMay1985-06-09
6 | NULL | NULL [or] GinaEricson1950-11-04
请注意,SyntheticID列仅使用来自其中一个匹配源行的值。哪件事并不重要。我将此视图导出到另一个应用程序,并且需要能够识别每个&#34;匹配组&#34;然后。
我明白我的意思吗?有什么想法可以在sql中完成吗?
也许有助于详细说明实际用例:
我正在从不同系统导入合同。为了解释打字错误的可能性或已结婚的人但姓氏只在一个系统中更新,我需要找到所谓的“可能的匹配”。如果两个或多个合同包含相同的生日加上相同的第一个,最后一个或出生名称,则认为它们可能匹配。这意味着,如果合同A与合同B匹配,则合同B也与合同A匹配。
目标系统使用多值引用属性来存储这些关系。最终目标是为这些合同创建用户对象。首先要注意的是,对于多个匹配的合同,它应该只是一个用户对象。因此,我在视图中创建这些匹配项。第二个问题是,用户对象的创建是通过工作流实现的,工作流为每个合同并行运行。为了避免为匹配的合同创建多个用户对象,每个工作流需要检查是否已经存在匹配的用户对象或另一个工作流,即将创建所述用户对象。因为与sql相比,工作流引擎非常慢,所以工作流不应重复整个匹配测试。因此,我们的想法是让工作流程只检查“合成ID”。
答案 0 :(得分:3)
我用多步骤方法解决了这个问题:
首先,让我解释一下我所理解的内容,以便判断我的方法是否正确。
1)匹配在&#34;级联&#34;
中传播我的意思是,如果&#34; Peter Smith&#34;与彼得格雷&#34;分组,这意味着所有史密斯和所有格雷都是相关的(如果他们有相同的出生日期),所以卢克史密斯可以在同一组约翰格雷
2)我不明白你的意思&#34;出生名称&#34;
你说合同匹配&#34;第一个,最后一个或出生名称&#34;,对不起,我是意大利人,我认为出生名称和第一个是相同的,也在你的数据中没有这样的列。也许它与名字之间的短划线符号有关? 当FirstName是Frank-Paul时,它意味着它应该匹配Frank和Paul吗? 当LastName是Gray-Smith时,它意味着它应该匹配Gray和Smith?
在下面的代码中,我简单地忽略了这个问题,但是如果需要可以处理它(我已经尝试过,打破了名称,将它们解开并将其视为双重匹配)。
第零步:一些声明和准备基础数据
declare @cli as table (UniqueID int primary key, FirstName varchar(20), LastName varchar(20), Birthday varchar(20))
declare @comb as table (id1 int, id2 int, done bit)
declare @grp as table (ix int identity primary key, grp int, id int, unique (grp,ix))
declare @str_id as table (grp int primary key, SyntheticID varchar(1000))
declare @id1 as int, @g int
;with
t as (
select *
from (values
(1 , 'Peter' , 'Smith' , '1980-11-04'),
(2 , 'Peter' , 'Gray' , '1980-11-04'),
(3 , 'Peter' , 'Gray-Smith', '1980-11-04'),
(4 , 'Frank' , 'May' , '1985-06-09'),
(5 , 'Frank-Paul', 'May' , '1985-06-09'),
(6 , 'Gina' , 'Ericson' , '1950-11-04')
) x (UniqueID , FirstName , LastName , Birthday)
)
insert into @cli
select * from t
第一步:为每份合约创建可能的第一级匹配列表
;with
p as(select UniqueID, Birthday, FirstName, LastName from @cli),
m as (
select p.UniqueID UniqueID1, p.FirstName FirstName1, p.LastName LastName1, p.Birthday Birthday1, pp.UniqueID UniqueID2, pp.FirstName FirstName2, pp.LastName LastName2, pp.Birthday Birthday2
from p
join p pp on (pp.Birthday=p.Birthday) and (pp.FirstName = p.FirstName or pp.LastName = p.LastName)
where p.UniqueID<=pp.UniqueID
)
insert into @comb
select UniqueID1,UniqueID2,0
from m
第二步:创建基本组列表
insert into @grp
select ROW_NUMBER() over(order by id1), id1 from @comb where id1=id2
第三步:迭代更新组列表的匹配列表 只在需要匹配和更新的合同上循环
set @id1 = 0
while not(@id1 is null) begin
set @id1 = (select top 1 id1 from @comb where id1<>id2 and done=0)
if not(@id1 is null) begin
set @g = (select grp from @grp where id=@id1)
update g set grp= @g
from @grp g
inner join @comb c on g.id = c.id2
where c.id2<>@id1 and c.id1=@id1
and grp<>@g
update @comb set done=1 where id1=@id1
end
end
第四步:构建SyntheticID 递归地将组的所有(不同的)名字和姓氏添加到SyntheticID 我使用了&#39; _&#39;作为出生日期的分隔符,名字和姓氏,以及&#39;,&#39;作为名称列表的分隔符以避免冲突。
;with
c as(
select c.*, g.grp
from @cli c
join @grp g on g.id = c.UniqueID
),
d as (
select *, row_number() over (partition by g order by t,s) n1, row_number() over (partition by g order by t desc,s desc) n2
from (
select distinct c.grp g, 1 t, FirstName s from c
union
select distinct c.grp, 2, LastName from c
) l
),
r as (
select d.*, cast(CONVERT(VARCHAR(10), t.Birthday, 112) + '_' + s as varchar(1000)) Names, cast(0 as bigint) i1, cast(0 as bigint) i2
from d
join @cli t on t.UniqueID=d.g
where n1=1
union all
select d.*, cast(r.names + IIF(r.t<>d.t,'_',',') + d.s as varchar(1000)), r.n1, r.n2
from d
join r on r.g = d.g and r.n1=d.n1-1
)
insert into @str_id
select g, Names
from r
where n2=1
第五步:输出结果
select c.UniqueID, case when id2=UniqueID then id1 else id2 end PossibleMatchingContract, s.SyntheticID
from @cli c
left join @comb cb on c.UniqueID in(id1,id2) and id1<>id2
left join @grp g on c.UniqueID = g.id
left join @str_id s on s.grp = g.grp
以下是结果
UniqueID PossibleMatchingContract SyntheticID
1 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
1 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
2 3 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 1 1980-11-04_Peter_Gray,Gray-Smith,Smith
3 2 1980-11-04_Peter_Gray,Gray-Smith,Smith
4 5 1985-06-09_Frank,Frank-Paul_May
5 4 1985-06-09_Frank,Frank-Paul_May
6 NULL 1950-11-04_Gina_Ericson
我认为通过这种方式生成的SyntheticID也应该是&#34; unique&#34;对于每个小组
答案 1 :(得分:1)
这会创建一个合成值,并且很容易根据您的需要进行更改。
DECLARE @T TABLE (
UniqueID INT
,FirstName VARCHAR(200)
,LastName VARCHAR(200)
,Birthday DATE
)
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 1,'Peter','Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 2,'Peter','Gray','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 3,'Peter','Gray-Smith','1980-11-04'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 4,'Frank','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 5,'Frank-Paul','May','1985-06-09'
INSERT INTO @T(UniqueID,FirstName,LastName,Birthday) SELECT 6,'Gina','Ericson','1950-11-04'
DECLARE @PossibleMatches TABLE (UniqueID INT,[PossibleMatch] INT,SynKey VARCHAR(2000)
)
INSERT INTO @PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM @T t1
INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO @PossibleMatches
SELECT t1.UniqueID [UniqueID],t2.UniqueID [Possible Matches],'Fn=' + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM @T t1
INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
AND t1.FirstName=t2.FirstName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO @PossibleMatches
SELECT t1.UniqueID,t2.UniqueID,'Ln=' + t1.LastName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM @T t1
INNER JOIN @T t2 ON t1.Birthday=t2.Birthday
AND t1.LastName=t2.LastName
AND t1.UniqueID<>t2.UniqueID
INSERT INTO @PossibleMatches
SELECT t1.UniqueID,pm.UniqueID,'Ln=' + t1.LastName + ' Fn=' + + t1.FirstName + ' DoB=' + CONVERT(VARCHAR,t1.Birthday,102) [SynKey]
FROM @T t1
LEFT JOIN @PossibleMatches pm on pm.UniqueID=t1.UniqueID
WHERE pm.UniqueID IS NULL
SELECT *
FROM @PossibleMatches
ORDER BY UniqueID,[PossibleMatch]
答案 2 :(得分:1)
我认为这对你有用
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C INNER JOIN
[dbo].AllContracts AS CC ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN(
SELECT UniqueID FROM [dbo].DefinitiveMatches)
AND C.AssociatedUser IS NULL
答案 3 :(得分:0)
你可以试试这个:
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
FIRST_VALUE(CC.FirstName+CC.LastName+CC.Birthday)
OVER (PARTITION BY C.UniqueID ORDER BY CC.UniqueID) as SyntheticID
FROM
[dbo].AllContracts AS C
INNER JOIN
[dbo].AllContracts AS CC
ON
C.SecondaryMatchCodeFB = CC.SecondaryMatchCodeFB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeLB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeLB = CC.SecondaryMatchCodeBB
OR
C.SecondaryMatchCodeBB = CC.SecondaryMatchCodeLB
WHERE
C.UniqueID NOT IN
(
SELECT UniqueID FROM [dbo].DefinitiveMatches
)
AND
C.AssociatedUser IS NULL
这会产生一个额外的行(因为我们遗漏了C.UniqueID&lt;&gt; CC.UniqueID)但会给你一个好的灵魂。
答案 4 :(得分:0)
下面是一个示例,其中包含从原始帖子中提取的一些示例数据。想法:在CTE中生成所有SyntheticID
,使用“PossibleMatch”查询所有记录,并将其与所有尚未包含的记录联合起来:
DECLARE @t TABLE(
UniqueID int
,FirstName nvarchar(20)
,LastName nvarchar(20)
,Birthday datetime
)
INSERT INTO @t VALUES (1, 'Peter', 'Smith', '1980-11-04');
INSERT INTO @t VALUES (2, 'Peter', 'Gray', '1980-11-04');
INSERT INTO @t VALUES (3, 'Peter', 'Gray-Smith', '1980-11-04');
INSERT INTO @t VALUES (4, 'Frank', 'May', '1985-06-09');
INSERT INTO @t VALUES (5, 'Frank-Paul', 'May', '1985-06-09');
INSERT INTO @t VALUES (6, 'Gina', 'Ericson', '1950-11-04');
WITH ctePrep AS(
SELECT UniqueID, FirstName, LastName, BirthDay,
ROW_NUMBER() OVER (PARTITION BY FirstName, BirthDay ORDER BY FirstName, BirthDay) AS k,
FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM @t
),
cteKeys AS(
SELECT FirstName, BirthDay, SyntheticID
FROM ctePrep
WHERE k = 1
),
cteFiltered AS(
SELECT
C.UniqueID,
CC.UniqueID AS PossiblyMatchingContracts,
keys.SyntheticID
FROM @t AS C
JOIN @t AS CC ON C.FirstName = CC.FirstName
AND C.Birthday = CC.Birthday
JOIN cteKeys AS keys ON keys.FirstName = c.FirstName
AND keys.Birthday = c.Birthday
WHERE C.UniqueID <> CC.UniqueID
)
SELECT UniqueID, PossiblyMatchingContracts, SyntheticID
FROM cteFiltered
UNION ALL
SELECT UniqueID, NULL, FirstName+LastName+CONVERT(nvarchar(10), Birthday, 126) AS SyntheticID
FROM @t
WHERE UniqueID NOT IN (SELECT UniqueID FROM cteFiltered)
希望这会有所帮助。结果对我来说很好看:
UniqueID PossiblyMatchingContracts SyntheticID
---------------------------------------------------------------
2 1 PeterSmith1980-11-04
3 1 PeterSmith1980-11-04
1 2 PeterSmith1980-11-04
3 2 PeterSmith1980-11-04
1 3 PeterSmith1980-11-04
2 3 PeterSmith1980-11-04
4 NULL FrankMay1985-06-09
5 NULL Frank-PaulMay1985-06-09
6 NULL GinaEricson1950-11-04
答案 5 :(得分:0)
在SSMS中测试,它完美无缺。 :)
--create table structure
create table #temp
(
uniqueID int,
firstname varchar(15),
lastname varchar(15),
birthday date
)
--insert data into the table
insert #temp
select 1, 'peter','smith','1980-11-04'
union all
select 2, 'peter','gray','1980-11-04'
union all
select 3, 'peter','gray-smith','1980-11-04'
union all
select 4, 'frank','may','1985-06-09'
union all
select 5, 'frank-paul','may','1985-06-09'
union all
select 6, 'gina','ericson','1950-11-04'
select * from #temp
--solution is as below
select ab.uniqueID
, PossiblyMatchingContracts
, c.firstname+c.lastname+cast(c.birthday as varchar) as synID
from
(
select a.uniqueID
, case
when a.uniqueID < min(b.uniqueID)over(partition by a.uniqueid)
then a.uniqueID
else min(b.uniqueID)over(partition by a.uniqueid)
end as SmallestID
, b.uniqueID as PossiblyMatchingContracts
from #temp a
left join #temp b
on (a.firstname = b.firstname OR a.lastname = b.lastname) AND a.birthday = b.birthday AND a.uniqueid <> b.uniqueID
) as ab
left join #temp c
on ab.SmallestID = c.uniqueID
结果捕获如下:
答案 6 :(得分:0)
假设我们有下表(在您的情况下为VIEW):
/timer
在您的情况下,您可以为每行使用UniqueID PossiblyMatchingContracts SyntheticID
1 2 G1
1 3 G2
2 1 G3
2 3 G4
3 1 G4
3 4 G6
4 5 G7
5 4 G8
6 NULL G9
将初始SyntheticID
设置为PeterSmith1980-11-04
之类的字符串。这是一个递归CTE查询,它将所有行划分为未连接的组,并在当前组中选择UniqueID
作为该组中所有行的新MAX(SyntheticId)
。
SyntheticID