在mssql中查找dublets和变体

时间:2015-02-19 17:55:34

标签: sql-server

我有一个包含大量数据的表。在那个表中有一行,没有唯一的ID,所以可以有它们的二重奏 - 我通过这个查询找到了它们:

SELECT theid FROM thetable
GROUP BY theid
HAVING COUNT(*) > 1

在表格中还有像street1,street2,city1,city2

这样的列

在我找到dublets的第一个查询的行列表中,在那些我需要检查street1是否与street2不同,city1与city2不同,在第一个查询中给定id的任何dublets中都有意义?

所以假设我们有两行具有相同的ID - 在那些我需要检查street1是否与具有特定id的所有行中的street1不同

有关如何执行此操作的任何提示,指示,我正在盲目主演此问题,并且似乎无法找到正确的查询。

非常感谢

2 个答案:

答案 0 :(得分:0)

使用CTE将有助于:

;WITH CTE AS
(
  SELECT theID,
         Street1,
         Street2,
         Street3,
         City,
         State,
         Zip,
         rn = ROW_NUMBER() OVER(PARTITION BY theID ORDER BY theID)
  FROM Table
  -- add joins if necessary
)
SELECT oldestID = c1.theID,
       oldestStreet1 = c1.Street1,
       newestStreet1 = c2.Street1,
       newestID = c2.theID
FROM CTE c1
INNER JOIN CTE c2 ON c2.rn = c1.rn + 1

您还可以添加一个case语句来显示匹配与不匹配。这将有助于手动识别拼写错误(1337 Test St. vs 1337 Test Street):

;WITH CTE AS
(
  SELECT theID,
         Street1,
         Street2,
         Street3,
         City,
         State,
         Zip,
         rn = ROW_NUMBER() OVER(PARTITION BY theID ORDER BY theID)
  FROM Table
  -- add joins if necessary
)
SELECT oldestID = c1.theID,
       oldestStreet1 = CASE WHEN c1.Street1 = c2.Street1 THEN 'Match' ELSE c1.Street1 END,
       newestStreet1 = CASE WHEN c1.Street1 = c2.Street1 THEn 'Match' ELSE c2.Street1 END,
       newestID = c2.theID
FROM CTE c1
INNER JOIN CTE c2 ON c2.rn = c1.rn + 1

或者您可以通过将其添加到INNER JOIN子句中来返回不匹配的项目:

;WITH CTE AS
(
  SELECT theID,
         Street1,
         Street2,
         Street3,
         City,
         State,
         Zip,
         rn = ROW_NUMBER() OVER(PARTITION BY theID ORDER BY theID)
  FROM Table
  -- add joins if necessary
)
SELECT oldestID = c1.theID,
       oldestStreet1 = c1.Street1,
       newestStreet1 = c2.Street1,
       newestID = c2.theID
FROM CTE c1
INNER JOIN CTE c2 ON c2.rn = c1.rn + 1 AND c1.Street1 <> c2.Street1 -- add as many of these as you need.

请注意,这些是完全匹配。您可以实现静态模糊逻辑LEFT(Zip, 5)以仅匹配邮政编码的前5位数字(如果某些邮政编码有一个拉链+4而某些邮件不是。)

答案 1 :(得分:0)

你也可以这样分析,

;WITH CTE AS
(
  SELECT theID,
         Street1,
         Street2,
         Street3,
         City,
         State,
         Zip,
         rn = ROW_NUMBER() OVER(PARTITION BY theID ORDER BY theID)
  FROM Table
  -- add joins if necessary
)
,
CTE1 as
(
 select *,ROW_NUMBER() 
OVER(PARTITION BY theID,Street1,Street2,City,State,Zip 
oRDER BY theID) rn2 from cte where rn>2
)
select * from cte1