TL; DR:向下滚动到TASK 2.
我正在处理以下数据集:
email,createdby,createdon
a@b.c,jsmith,2016-10-10
a@b.c,nsmythe,2016-09-09
a@b.c,vstark,2016-11-11
b@x.y,ajohnson,2015-02-03
b@x.y,elear,2015-01-01
...
等等。每封电子邮件都保证在数据集中至少有一个副本。
现在,要解决两个任务;我解决了其中一个,但我正在与另一个挣扎。我现在将完成这两项任务。
任务1(已解决): 对于每一行,对于每封电子邮件,返回一个附加列,其中包含使用此电子邮件创建第一条记录的用户的名称。
上述样本数据集的预期结果:
email,createdby,createdon,original_createdby
a@b.c,jsmith,2016-10-10,nsmythe
a@b.c,nsmythe,2016-09-09,nsmythe
a@b.c,vstark,2016-11-11,nsmythe
b@x.y,ajohnson,2015-02-03,elear
b@x.y,elear,2015-01-01,elear
获取上述代码:
;WITH q0 -- this is just a security measure in case there are unique emails in the data set
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
FROM q1
ORDER BY q1.email
, q1.rn
简要说明:我通过电子邮件对数据进行分区,然后按创建日期排序每个分区中的行数,最后从第(rn-1)个记录返回[createdby]值。完全符合预期。
现在,与上面类似,有TASK 2:
任务2: 对于每一行,对于每封电子邮件,返回创建第一个副本的用户的名称。即rn = 2的用户名。
预期结果:
email,createdby,createdon,first_dupl_createdby
a@b.c,jsmith,2016-10-10,jsmith
a@b.c,nsmythe,2016-09-09,jsmith
a@b.c,vstark,2016-11-11,jsmith
b@x.y,ajohnson,2015-02-03,ajohnson
b@x.y,elear,2015-01-01,ajohnson
我希望保持高性能,以便尝试使用LEAD-LAG功能:
WITH q0
AS ( SELECT t.email
FROM t
GROUP BY t.email
HAVING COUNT(*) > 1) ,
q1
AS ( SELECT q0.email
, createdon
, createdby
, ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
FROM t
JOIN q0
ON t.email = q0.email)
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, CASE q1.rn
WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
END AS first_dupl_createdby
FROM q1
ORDER BY q1.email
, q1.rn
说明:对于每个分区中的第一条记录,从以下记录返回[createdby](即从包含第一个副本的记录中)。对于同一分区中的所有其他记录,从(rn-2)记录前返回[createdby](即对于rn = 2,我们将保留在同一记录中,对于rn = 3,我们将返回1条记录,因为rn = 4 - 2记录等等。)
出现了一个问题
ELSE LAG(q1.createdby, q1.rn - 2)
操作。显然,对于任何逻辑,尽管存在前一行(WHEN 1 THEN ...),ELSE块也会被计算为rn = 1,导致传递给LAG函数的负偏移值:
Msg 8730,Level 16,State 2,Line 37 滞后和超前函数的偏移参数不能为负值。
当我注释掉ELSE行时,整个工作正常但很明显我没有在rn>的first_dupl_createdby列中得到任何结果。 1。
问题: 有没有办法重写上面的CASE语句(在TASK#2中),这样它总是从每个分区中rn = 2的记录中返回值,但是 - 这很重要 - 没有进行自JOIN操作(我知道我可以在一个单独的子查询中准备rn = 2的行,但这意味着要对整个表进行额外扫描,并运行不必要的自连接)。
答案 0 :(得分:1)
您可以使用row_number()
和条件汇总
select email,
max(case when seqnum = 1 then createdby end) as createdby_first,
max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
row_number() over (partition by email order by createdon) as seqnum
from t
) t
group by email;
您可以join
将此信息恢复为原始数据,以获取您想要的信息。我不知道lag()
如何自然地用来解决这个问题。
答案 1 :(得分:1)
我认为您可以简单地使用max
窗口函数,因为您尝试从每个分区获取rownumber = 2的值。
SELECT q1.email
, q1.createdon
, q1.createdby
, q1.rn
, max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM q1
ORDER BY q1.email, q1.rn
您也可以使用类似的查询来获取第一个场景的rownumber = 1的结果。
答案 2 :(得分:0)
/耸肩
; WITH duplicate_email_addresses AS (
SELECT email
FROM t
GROUP
BY email
HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
SELECT email
, createdon
, createdby
, Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
FROM t
WHERE EXISTS (
SELECT *
FROM duplicate_email_addresses
WHERE email = t.email
)
)
, second_duplicate_record AS ( -- Why do you need any more than this?
SELECT email
, createdon
, createdby
FROM records_with_duplicate_email_addresses
WHERE sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
, records_with_duplicate_email_addresses.createdon
, records_with_duplicate_email_addresses.createdby
, second_duplicate_record.createdby AS first_duplicate_createdby
FROM records_with_duplicate_email_addresses
INNER
JOIN second_duplicate_record
ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;