SQL分区和分析函数的使用

时间:2015-03-24 18:07:23

标签: sql oracle window-functions

我一直在尝试创建一个包含2列party_id和matched_pa​​rty_id的表来识别重复项,如下所述:

正如您将看到[First_NM + Last_NM + Zip_CD],我们可以识别出一些重复的记录。我想匹配这个并将它们保存在最常见的party_id

的保护伞中
PARTY_ID    FIRST_NM    LAST_NM ZIP_CD
----------------------------------------    
95678   JANE    DOE 7075
12345   JOHN    DOE 7000
10000   JOHN    DOE 7075
10000   JOHN    DOE 7075
95678   JOHN    DOE 7075
95678   JOHN    DOE 7075
95678   JOHN    DOE 7075
88648   JOHN    DOE 7075
88648   JOHN    DOE 7075
23456   JOHN    DOE 7075
95678   SAM DOE 7075
95678   SAM DOE 7075

必需的输出

Party_ID    Matched_ID
-----------------------
95678   10000
95678   88648
95678   23456

因为在重复分区中我们已经识别出4个不同的party_id 即95678,10000,88648和23456以及95678发生次数最多,因此需要将所有其他party_id与该party_id匹配。

这是我正在使用的代码。但是没有LEAD_PI空值

SELECT MAX_PI AS PARTY_ID, LEAD_PI AS MATCHED_ID from  
(SELECT DISTINCT B.FIRST_NM, B.MDDL_NM, B.LAST_NM, B.ZIP_CD,MAX_PI,LEAD_PI,  
FROM (SELECT I.PARTY_ID,I.FIRST_NM, I.MDDL_NM, I.LAST_NM,A.ZIP_CD,  
             LEAD(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.MDDL_NM,   I.LAST_NM,  A.ZIP_CD ORDER BY I.FIRST_NM, I.MDDL_NM,I.LAST_NM, A.ZIP_CD) AS LEAD_PI,  MAX(I.PARTY_ID) OVER (PARTITION BY I.FIRST_NM, I.MDDL_NM, I.LAST_NM, A.ZIP_CD) AS MAX_PI  
      FROM INDVDL I JOIN PARTY_ADDR A  
      ON I.PARTY_ID = A.PARTY_ID  
     ) B  
WHERE MIN_PI <> MAX_PI  
AND MAX_PI <> nvl(LEAD_PI,0)

1 个答案:

答案 0 :(得分:0)

如果您的基表没有包含任何完全重复的行(并且我不确定这种完全重复的重要性可能是什么),这会更简单一些,但您可以使用所描述的数据来完成。看起来沿着这些方向的东西会做你想做的事情:

SELECT DISTINCT party_id, matched_id
FROM (
    SELECT
      MAX(i.party_id) OVER (
        PARTITION BY i.first_nm, i.middle_nm, i.last_nm, a.zip_cd
      ) AS party_id,
      i.party_id AS matched_id
    FROM
      indvdl i
      JOIN party_addr a  
        ON i.party_id = a.party_id
  )
WHERE party_id != matched_id

内联视图使得该查询的表达略微简单,但我认为你可以不用。您也可以在不使用窗口函数的情况下获得所需的结果:

WITH nz AS (SELECT i.party_id, i.first_nm, i.last_nm, a.zip_cd
  FROM indvdl i JOIN party_addr a ON i.party_id = a.party_id)
SELECT DISTINCT
  ref.party_id AS party_id, 
  matched.party_id AS matched_id
FROM (
    SELECT MAX(party_id) AS party_id, first_nm, last_nm, zip_cd
    FROM nz
    GROUP BY first_nm, last_nm, zip_cd
    HAVING MAX(party_id) != MIN(party_id)
  ) ref
  JOIN nz matched
    ON ref.first_nm = matched.first_nm
      AND ref.last_nm = matched.last_nm
      AND ref.zip_cd = matched.zip_cd
WHERE ref.party_id != matched.party_id

无论哪种方式,我都怀疑LEAD()函数没有按照你的想法行事,我确信它不能完成你所需的工作。