Question

我有一个我试图去复制的组织列表。每个组织在三个不同的领域最多可以有三个识别号码。我当前使用的工具是Oracle SQL数据库和SAP数据服务。

Name  |ID1     |ID2     |ID3
------|--------|--------|--------
Org1  |1       |<null>  |2        
Org2  |<null>  |1       |<null>  
Org3  |2       |<null>  |<null>

所有这三个组织都应该能够被识别为单个组织。

第一种方法

我解决这个问题的第一个方法是将其分解为orgs和ID的有序列表。

Name  |ID        
------|-------
Org1  |1              
Org2  |1       
Org1  |2      
Org3  |2

从那时起，我设法使用数据转换的组合制作仅重复ID的列表，其中包括仅选择可能的ID列表，对它们进行排序，删除唯一ID，为每个ID分配ROW_ID（表示作为一个字母，为了这个例子，即使它看起来多余）。

ID    |Duplicate_Group     
------|--------
1     |A              
2     |B

但是我要加入我的数据，它不会解决我的问题。它只会导致进一步的重复。在我意识到我们的组织每个都有多个ID之前，我已经接受了这个课程作为我的解决方案：

Name  |ID1     |ID2     |ID3     |Duplicate_Group
------|--------|--------|--------|--------
Org1  |1       |<null>  |2       |A 
Org1  |1       |<null>  |2       |B 
Org2  |<null>  |1       |<null>  |A 
Org3  |2       |<null>  |<null>  |B

新方法

我的下一个想法是类似地分配字母以区分我的组织中的组，但是......循环数据。

Name  |ID      |Duplicate_Group
------|--------|--------
Org1  |1       |A 
Org1  |2       |B 
Org2  |1       |A 
Org3  |2       |B

首先，按名称，ID排序，我会检查名称或ID是否与前一行相同，然后将该重复组作为您自己的。

Name  |ID      |D_Grp1  |D_Grp2
------|--------|--------|--------
Org1  |1       |A       | 
Org1  |2       |B       |A
Org2  |1       |A       | 
Org3  |2       |B       |

请注意，Org1的D_Grp2在第二行中已更改。现在我将旧的D_Grp1合并到D_Grp2中，然后通过ID再次按顺序再次执行，然后是Name;然后再次根据前一行更新组。

Name  |ID      |D_Grp2  |D_Grp3
------|--------|--------|--------
Org1  |1       |A       | 
Org2  |1       |A       | 
Org1  |2       |A       | 
Org3  |2       |B       |A

由于第四行的ID与上面相同，但有不同的D_Grp2，第四行会更新它的D_Grp3以匹配。我的想法是，我会一遍又一遍地通过ID和名称循环这个排序过程，直到没有更多的变化为止。我将某种列或变量作为一个标志 - 在每个循环之后，如果没有标记的标记，我将假设所有内容都是合并的。我将在原始表格中应用一个独特的并重新打击它。

Name  |ID1     |ID2     |ID3     |Duplicate_Group
------|--------|--------|--------|--------
Org1  |1       |<null>  |2       |A 
Org2  |<null>  |1       |<null>  |A 
Org3  |2       |<null>  |<null>  |A

范围方面我有大约6万个组织，所以循环不是问题所在，但是它会让我失望，需要很长时间，而且似乎无法设计出更好的流程。我也不太确定我是否错过了任何根据名称和ID排序可能永远不会合并记录的边缘情况。

Inconclusion

那么StackOverflow，是否有更好的方法来识别此表中的重复项？我接受任何答案都可以接受，包括SQL。请理解SAP数据服务的逻辑基本上是SQL，但我并不直接编写自己的SQL - 所以到目前为止我无法提供我的进程的SQL版本。

Answer 1

对这三个字段进行self-JOIN。

SELECT 
   o.Name

   match12.ID2
   match12.Name

   match23.ID3
   match23.Name

   match31.ID1
   match31.Name
FROM organisations AS o
INNER JOIN organisations AS match12 ON o.ID1 = match12.ID2 
INNER JOIN organisations AS match23 ON o.ID2 = match23.ID3 
INNER JOIN organisations AS match31 ON o.ID3 = match31.ID1

这应该为您提供所有在不同ID列上匹配的组织的列表。

请注意，自我JOIN可能非常昂贵，因此对于大型数据集，此查询可能需要一段时间。

Answer 2

这绝对是一个迭代过程，因此要求递归查询。

一定要找链子。如果ID 1 = 2且2 = 3且4 = 5且5 = 6且2 = 5，那么所有这些ID都意味着同一家公司。

这是我的算法：

查找所有对1 = 2,2 = 1,2 = 3，...
查找所有链（对于ID 1：1 = 2 = 3,1 = 2 = 5，对于ID 2：2 = 1,2 = 3,2 = 5对于ID 3：3 = 2 = 1,3 = 2 = 5，...）并将链中的所有ID作为链号给出调用ID。（因此ID 3从第四个链中的第一个链2开始具有链号1，从第六个和第七个链中具有3个链。）
通过将每个ID与其最小的链号相关联来构建组。（因此ID 3的链号为1; ID 3属于“chain = 1”组。）所有具有相同最小链号的ID代表同一公司。

以下是查询：

with pairs as
(
  select id1 as id, id2 as other from mytable where id1 <> id2
  union all
  select id1 as id, id3 as other from mytable where id1 <> id3
  union all
  select id2 as id, id1 as other from mytable where id2 <> id1
  union all
  select id2 as id, id3 as other from mytable where id2 <> id3
  union all
  select id3 as id, id1 as other from mytable where id3 <> id1
  union all
  select id3 as id, id2 as other from mytable where id3 <> id2
)
, chains(chain, id) as
(
  select id as chain, id from pairs
  union all
  select c.chain, p.other as id
  from chains c
  join pairs p on p.id = c.id
)
cycle chain, id set cycle to 1 default 0
, groups as
(
  select id, min(chain) as grp 
  from chains 
  group by id
)
select distinct g.grp, m.*
from groups g
join mytable m on g.id in (m.id1, m.id2, m.id3)
order by g.grp, m.name;

当你有很多ID意味着同一家公司（即要评估的链条很多）时，这个查询会非常慢。如果只有少数这样的事件，似乎更有可能，那么查询将非常快。试试： - ）

我应该如何识别每个记录可能具有多个身份的重复组？

第一种方法

新方法

Inconclusion

2 个答案: