我的公司一直在数据库中接收重复记录,这些记录是在第一个记录之后的4分钟内创建的。从逻辑上讲,一组记录包括原始记录以及在该4分钟时间范围内创建的任何后续记录。初始记录的TO_DELETE
值为'N'
,而每条重复记录的TO_DELETE
值为'Y'
。每个新组都以'N'
值重新开始。
在Deleting Invalid Duplicate Rows in SQL的帮助下,我已经整理了一个查询来选择它们但是它已经运行了2个多小时并且还没有返回结果集所以我不确定它是否被捕获了无限循环。任何有关它的问题的帮助将不胜感激!
with LEAD_CTE as
(
select *, ROW_NUMBER() over (partition by LASTNAME, FIRSTNAME, EMAIL, PRIMARY_PHONE, PROGRAMX, TERM_CODE, INQ_TYPE, LEADSOURCE order by CREATEDDATE) as ROWNUMBER
from LEAD
where DELETE_FLAG <> 'Y'
and CREATEDDATE >= (GETDATE() - 7)
),
CTE as
(
select ROWNUMBER, 'N' as TO_DELETE, CREATEDDATE, 0 as TOTAL_MINUTES, LASTNAME, FIRSTNAME, EMAIL, PRIMARY_PHONE, PROGRAMX, TERM_CODE, INQ_TYPE, LEADSOURCE
from LEAD_CTE
where ROWNUMBER = 1
union all
select l.ROWNUMBER,
case when ((c.TOTAL_MINUTES + DATEDIFF(MINUTE, c.CREATEDDATE, l.CREATEDDATE)) > 4) then 'N' else 'Y' end as TO_DELETE,
l.CREATEDDATE,
case when ((c.TOTAL_MINUTES + DATEDIFF(MINUTE, c.CREATEDDATE, l.CREATEDDATE)) > 4) then 0 else (c.TOTAL_MINUTES + DATEDIFF(MINUTE, c.CREATEDDATE, l.CREATEDDATE)) end as TOTAL_MINUTES,
l.EMAIL, l.FIRSTNAME, l.LASTNAME, l.PRIMARY_PHONE, l.PROGRAMX, l.TERM_CODE, l.INQ_TYPE, l.LEADSOURCE
from LEAD_CTE l inner join CTE c on l.ROWNUMBER = (c.ROWNUMBER + 1)
)
select ROWNUMBER, TO_DELETE, CREATEDDATE, TOTAL_MINUTES, LASTNAME, FIRSTNAME, EMAIL, PRIMARY_PHONE, PROGRAMX, TERM_CODE, INQ_TYPE, LEADSOURCE
from CTE
order by LASTNAME, FIRSTNAME, EMAIL, PRIMARY_PHONE, PROGRAMX, TERM_CODE, INQ_TYPE, LEADSOURCE, CREATEDDATE
示例数据:
CREATEDDATE | LASTNAME | FIRSTNAME | EMAIL | PRIMARY_PHONE | PROGRAMX | TERM_CODE | INQ_TYPE | LEADSOURCE
---------------------------------------------------------------------------------------------------------------------------------------------
2013-09-24 00:06:01.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:47.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:50.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:52.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:52.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:54.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:55.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:56.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2013-09-24 00:18:56.000 | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
自我加入的新CTE:
with LEAD_CTE as
(
select *, ROW_NUMBER() over (partition by LASTNAME, FIRSTNAME, EMAIL, PRIMARY_PHONE, PROGRAMX, TERM_CODE, INQ_TYPE, LEADSOURCE order by CREATEDDATE) as ROWNUMBER
from LEAD
where DELETE_FLAG <> 'Y'
and CREATEDDATE >= (GETDATE() - 7)
)
select l1.ROWNUMBER, l1.CREATEDDATE, l2.CREATEDDATE, DATEDIFF(MINUTE, l1.CREATEDDATE, l2.CREATEDDATE), l1.LASTNAME, l1.FIRSTNAME, l1.EMAIL, l1.PRIMARY_PHONE, l1.PROGRAMX, l1.TERM_CODE, l1.INQ_TYPE, l1.LEADSOURCE
from LEAD_CTE l1 left join LEAD_CTE l2
on l1.ROWNUMBER = (l2.ROWNUMBER + 1)
and l1.LASTNAME = l2.LASTNAME
and l1.FIRSTNAME = l2.FIRSTNAME
and l1.EMAIL = l2.EMAIL
and l1.PRIMARY_PHONE = l2.PRIMARY_PHONE
and l1.PROGRAMX = l2.PROGRAMX
and l1.TERM_CODE = l2.TERM_CODE
and l1.INQ_TYPE = l2.INQ_TYPE
and l1.LEADSOURCE = l2.LEADSOURCE
order by l1.ROWNUMBER
实际输出:
ROWNUMBER | CREATEDDATE | CREATEDDATE | (no column name) | LASTNAME | FIRSTNAME | EMAIL | PRIMARY_PHONE | PROGRAMX | TERM_CODE | INQ_TYPE | LEADSOURCE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 2013-09-24 00:06:01.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
2 | 2013-09-24 00:18:47.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
3 | 2013-09-24 00:18:50.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
4 | 2013-09-24 00:18:52.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
5 | 2013-09-24 00:18:52.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
6 | 2013-09-24 00:18:54.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
7 | 2013-09-24 00:18:55.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
8 | 2013-09-24 00:18:56.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
9 | 2013-09-24 00:18:56.000 | NULL | NULL | Testerson | Testy | test@test.com | (123) 867-5309 | MS in Higher Education | NULL | inquiry | Webform
有趣的是,每条记录中的所有l2字段都以NULL
形式出现,我发现这是DATEDIFF()
计算返回NULL
的结果。我的预期输出是所有l2字段都具有下一个l1记录的值,但最后一个记录的l2字段除外,它们将是NULL
。
答案 0 :(得分:1)
我认为你非常接近你只需要添加
CASE WHEN Datediff(minute, l2.createddate, l1.createddate ) > 4
OR l2.createddate is null
THEN 'Y' ELSE 'N' END,
正如我在评论中提到的那样,你需要处理加入可空字段是一件痛苦的事实
WITH lead_cte
AS (SELECT *,
Row_number()
OVER (
partition BY lastname, firstname, email, primary_phone,
programx,
term_code,
inq_type, leadsource
ORDER BY createddate) AS ROWNUMBER
FROM lead
WHERE delete_flag <> 'Y'
AND createddate >= ( Getdate() - 7 ))
SELECT l1.rownumber,
l1.createddate,
l2.createddate,
Datediff(minute, l2.createddate, l1.createddate ) ,
CASE WHEN Datediff(minute, l2.createddate, l1.createddate ) > 4
OR l2.createddate is null
THEN 'Y' ELSE 'N' END,
l1.lastname,
l1.firstname,
l1.email,
l1.primary_phone,
l1.programx,
l1.term_code,
l1.inq_type,
l1.leadsource
FROM lead_cte l1
LEFT JOIN lead_cte l2
ON l1.rownumber = l2.rownumber +1
AND l1.lastname = l2.lastname
AND l1.firstname = l2.firstname
AND l1.email = l2.email
AND l1.primary_phone = l2.primary_phone
AND l1.programx = l2.programx
AND (l1.term_code = l2.term_code
or ( l1.term_code is null and l2.term_code is null))
AND l1.inq_type = l2.inq_type
AND l1.leadsource = l2.leadsource
ORDER BY l1.rownumber