Question

我有一张表来自不同数据源的客户。有SSN，许可证＃和一些唯一ID，但并非所有源都具有相同的ID。我想比较ID列（SSN，许可证，SystemID）上的记录，并在找到同一个人时分配映射ID。

我假设我可以使用CTE但不知道从哪里开始。仍在努力学习我在SQL中的方式。任何帮助将不胜感激。感谢。

表格如下：

Source|RowID|SSN |License|SystemID
A     |1    |SSN1|Lic111 |
A     |2    |    |       |Sys666
B     |3    |SSN2|       |Sys777
C     |4    |SSN1|       |
D     |5    |    |Lic333 |
D     |6    |    |Lic333 |Sys666
E     |7    |    |       |Sys777

结果（添加了MapCustomerID）

Source|RowID|SSN |License|SystemID|MapCustomerID
A     |1    |SSN1|Lic111 |        |1
A     |2    |    |       |Sys666  |2
B     |3    |SSN2|       |Sys777  |3
C     |4    |SSN1|       |        |1
D     |5    |    |Lic999 |        |4
D     |6    |    |Lic333 |Sys666  |2
E     |7    |    |       |Sys777  |3

Answer 1

这可能是解决问题的“足够好”的方法。

沿着三个维度中的每一个，找到该维度的最小行ID（具有NULL的特殊处理）。然后，总体客户标识符是这三个ID中的最小值。要使其顺序无间隙，请使用dense_rank()。

with ids as (
      select t.*,
             (case when SSN is not null
                   then min(RowId) over (partition by SSN)
              end) as SSN_id,
             (case when License is not null
                   then min(RowId) over (partition by License)
              end) as License_id,
             (case when SystemId is not null
                   then min(RowId) over (partition by SystemId)
              end)as SystemId_id
      from t
     ),
     leastid as (
      select ids.*,
             (case when SSN_Id <= coalesce(License_Id, SSN_Id) and
                        SSN_Id <= coalesce(SystemId_id, SSN_Id)
                   then SSN_Id
                   when License_Id <= coalesce(SystemId_id, License_Id)
                   then License_Id
                   else SystemId_id
              end) as LeastId
      from ids
     )
select Source, RowID, SSN, License, SystemID,
       dense_rank(LeastId) over (order by LeastId) as MapCustomerId
from LeastIds;

这不是一个完整的解决方案，但它适用于您的数据。它在以下情况下不起作用：

A     |1    |SSN1|Lic111 |        |1
A     |2    |SSN1|       |Sys666  |2
A     |3    |    |       |Sys666  |2

因为这需要两个“跳”。

当我过去遇到这种情况时，我在表格中创建了额外的列，并重复使用update来获取不同维度的最小ID。这种迭代可以快速连接不同的部分。可能写一个递归CTE来做同样的事情。但是，上面更简单的解决方案可以解决您的问题。

编辑：

因为我之前遇到过这个问题，所以我想提出一个单一的查询解决方案（而不是迭代更新）。这可以使用递归CTE。以下代码似乎有效：

with t as (
    select 'A' as source, 1 as RowId, 'SSN1' as SSN, 'Lic111' as License, 'ABC' as SystemId union all
    select 'A', 2, 'SSN1', NULL, 'Sys666' union all
    select 'A', 3, NULL, NULL, 'Sys666' union all
    select 'A', 4, NULL, 'Lic222', 'Sys666' union all
    select 'A', 5, NULL, 'Lic222', NULL union all
    select 'A', 6, NULL, 'Lic444', NULL
   ),
    first as (
      select t.*,
             (select min(RowId)
              from t t2
              where t2.SSN = t.SSN or
                    t2.License = t.License or
                    t2.SystemId = t.SystemId
             ) as minrowid
      from t
   ),
   cte as (
    select rowid, minrowid
    from first
    union all
    select cte.rowid, first.minrowid
    from cte join
         first
         on cte.minrowid = first.rowid and
            cte.minrowid > first.minrowid
    ),
    lookup as (
      select rowid, min(minrowid) as minrowid,
             dense_rank() over (order by min(minrowid)) as MapCustomerId
      from cte
      group by rowid
    )

select t.*, lookup.MapCustomerId
from t join
     lookup
     on t.rowid = lookup.rowid;

SQL CTE比较同一个表中的行

1 个答案: