我一直在尝试创建一个包含2列party_id和matched_party_id的表来识别重复项,如下所述:
正如您将看到[First_NM + Last_NM + Zip_CD],我们可以识别出一些重复的记录。我想匹配这个并将它们保存在最常见的party_id
的保护伞中PARTY_ID FIRST_NM LAST_NM ZIP_CD
----------------------------------------
95678 JANE DOE 7075
12345 JOHN DOE 7000
10000 JOHN DOE 7075
10000 JOHN DOE 7075
95678 JOHN DOE 7075
95678 JOHN DOE 7075
95678 JOHN DOE 7075
88648 JOHN DOE 7075
88648 JOHN DOE 7075
23456 JOHN DOE 7075
95678 SAM DOE 7075
95678 SAM DOE 7075
必需的输出
Party_ID Matched_ID
-----------------------
95678 10000
95678 88648
95678 23456
因为在重复分区中我们已经识别出4个不同的party_id 即95678,10000,88648和23456以及95678发生次数最多,因此需要将所有其他party_id与该party_id匹配。
这是我正在使用的代码。但是没有LEAD_PI空值
SELECT MAX_PI AS PARTY_ID, LEAD_PI AS MATCHED_ID from
(SELECT DISTINCT B.FIRST_NM, B.MDDL_NM, B.LAST_NM, B.ZIP_CD,MAX_PI,LEAD_PI,
FROM (SELECT I.PARTY_ID,I.FIRST_NM, I.MDDL_NM, I.LAST_NM,A.ZIP_CD,
LEAD(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.MDDL_NM, I.LAST_NM, A.ZIP_CD ORDER BY I.FIRST_NM, I.MDDL_NM,I.LAST_NM, A.ZIP_CD) AS LEAD_PI, MAX(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.MDDL_NM, I.LAST_NM, A.ZIP_CD) AS MAX_PI
FROM INDVDL I JOIN PARTY_ADDR A
ON I.PARTY_ID = A.PARTY_ID
) B
WHERE MIN_PI <> MAX_PI
AND MAX_PI <> nvl(LEAD_PI,0)
答案 0 :(得分:0)
如果您的基表没有包含任何完全重复的行(并且我不确定这种完全重复的重要性可能是什么),这会更简单一些,但您可以使用所描述的数据来完成。看起来沿着这些方向的东西会做你想做的事情:
SELECT DISTINCT party_id, matched_id
FROM (
SELECT
MAX(i.party_id) OVER (
PARTITION BY i.first_nm, i.middle_nm, i.last_nm, a.zip_cd
) AS party_id,
i.party_id AS matched_id
FROM
indvdl i
JOIN party_addr a
ON i.party_id = a.party_id
)
WHERE party_id != matched_id
内联视图使得该查询的表达略微简单,但我认为你可以不用。您也可以在不使用窗口函数的情况下获得所需的结果:
WITH nz AS (SELECT i.party_id, i.first_nm, i.last_nm, a.zip_cd
FROM indvdl i JOIN party_addr a ON i.party_id = a.party_id)
SELECT DISTINCT
ref.party_id AS party_id,
matched.party_id AS matched_id
FROM (
SELECT MAX(party_id) AS party_id, first_nm, last_nm, zip_cd
FROM nz
GROUP BY first_nm, last_nm, zip_cd
HAVING MAX(party_id) != MIN(party_id)
) ref
JOIN nz matched
ON ref.first_nm = matched.first_nm
AND ref.last_nm = matched.last_nm
AND ref.zip_cd = matched.zip_cd
WHERE ref.party_id != matched.party_id
无论哪种方式,我都怀疑LEAD()
函数没有按照你的想法行事,我确信它不能完成你所需的工作。