因此,我尝试识别帐户中的重复联系人记录,并寻找执行此操作的最佳方式。有一个帐户表和一个联系表。以下是我提出的问题,以便向我提供我需要的内容,但我觉得可能有更好/更有效的方法来做到这一点,所以寻找任何反馈/建议。提前谢谢!
SELECT * FROM sysdba.CONTACT a WITH(NOLOCK)
WHERE EXISTS
(
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL FROM sysdba.CONTACT b WITH(NOLOCK)
GROUP BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
HAVING COUNT(*) > 1
AND a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL
)
ORDER BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
这是我可以做到的另一种方式,但是必须使用DISTINCT看起来很丑..
SELECT DISTINCT a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL FROM sysdba.CONTACT a WITH(NOLOCK)
JOIN sysdba.CONTACT b WITH(NOLOCK)
ON a.ACCOUNTID = b.ACCOUNTID AND a.FIRSTNAME = b.FIRSTNAME AND a.LASTNAME = b.LASTNAME AND a.EMAIL = b.EMAIL AND a.CONTACTID != b.CONTACTID
ORDER BY a.CONTACTID, a.FIRSTNAME, a.LASTNAME, a.EMAIL
当检查两者的执行计划时,第一个查询是37%,而第二个查询是63%,这是令人惊讶的,因为我总是(显然是错误的)使用连接比依赖更快一个where子句。
答案 0 :(得分:2)
当您尝试识别重复项时,通常的做法是使用窗口聚合函数,例如COUNT() OVER (...)
和ROW_NUMBER() OVER (...)
。
以下是应该返回记录组的查询,其中对于相同的CONTACTID
组合,有多个ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
。换句话说,此查询返回具有重复项的记录及其重复项:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
CNT = COUNT(*) OVER (PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE CNT > 1;
以下查询应仅返回重复项,没有重复的记录:
;WITH cteCONTACT
AS (
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID,
NUM = ROW_NUMBER() OVER (
PARTITION BY ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL
ORDER BY CONTACTID)
FROM sysdba.CONTACT
)
SELECT ACCOUNTID, FIRSTNAME, LASTNAME, EMAIL, CONTACTID
FROM cteCONTACT
WHERE NUM > 1;