字符串匹配后的SQL Server记录链接

时间:2016-12-22 11:51:21

标签: sql sql-server similarity

背景 - 我有一组客户数据,并使用字符串匹配算法来比较所有记录的相似性。然后,我需要直接或通过关联将相互关联的结果分组,并为每个组应用唯一的ID。

问题 - 我无法想出将记录链接在一起并为每个组应用唯一ID的方法

示例

对于已找到的匹配项目,数据目前看起来像这样(MatchScore与此处的问题无关,但要说明数据的来源)。

+-------------+-------------+------------+
| CustomerID1 | CustomerID2 | MatchScore |
+-------------+-------------+------------+
|     2021000 |     2707799 | 0.075      |
|     2021000 |     3856308 | 0.082      |
|      774062 |      774063 | 0.041      |
|      998328 |     2278386 | 0.063      |
|      998328 |      998329 | 0.058      |
|      998329 |     2278386 | 0.030      |
+-------------+-------------+------------+

底部的3条记录都是链接的,因此我希望它们具有相同的ID。

visual image of these records all being related

这就是我想要数据的样子

+----+-------------+-------------+------------+
| ID | CustomerID1 | CustomerID2 | MatchScore |
+----+-------------+-------------+------------+
|  1 |      998328 |     2278386 | 0.063      |
|  1 |      998328 |      998329 | 0.058      |
|  1 |      998329 |     2278386 | 0.030      |
|  2 |     2021000 |     2707799 | 0.075      |
|  2 |     2021000 |     3856308 | 0.082      |
|  3 |      774062 |      774063 | 0.041      |
+----+-------------+-------------+------------+

或类似地

+----+------------+
| ID | CustomerID |
+----+------------+
|  1 |    2278386 |
|  1 |     998328 |
|  1 |     998329 |
|  2 |    2021000 |
|  2 |    2707799 |
|  2 |    3856308 |
|  3 |     774062 |
|  3 |     774063 |
+----+------------+

生成示例表的代码

select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
into #tmp
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore

select * from #tmp

正如我所说,我无法想到如何将记录连接在一起,我尝试了各种各样的连接,但尤里卡时刻永远不会到来。请帮忙。

由于

1 个答案:

答案 0 :(得分:1)

我不确定这是你期望的结果,

with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore)


select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2
from(
    select 
        t1.*, 
        case 
            when t2.CustomerID1 IS NOT NULL 
                THEN t2.CustomerID1 
            ELSE t3.CustomerID1 
        end rank_value

    from tmp t1
    left join tmp t2 
    on (t1.CustomerID1 = t2.CustomerID2 
            and t1.CustomerID2!=t2.CustomerID1 
            and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2))
       or (t1.CustomerID1 = t2.CustomerID1 
             and t1.CustomerID2 != t2.CustomerID2 
             and (t1.CustomerID1 != t1.CustomerID2)) 
    left join tmp t3 
        on t1.CustomerID1 = t3.CustomerID2 
            and t1.CustomerID2=t3.CustomerID1
)t1

我得到以下结果

enter image description here

注意:版本2012中提供了DENSE_RANK()功能