CASE中的LAG给出假负偏移量

时间:2016-11-16 13:18:04

标签: sql tsql sql-server-2012 window-functions

TL; DR:向下滚动到TASK 2.

我正在处理以下数据集:

email,createdby,createdon
a@b.c,jsmith,2016-10-10
a@b.c,nsmythe,2016-09-09
a@b.c,vstark,2016-11-11
b@x.y,ajohnson,2015-02-03
b@x.y,elear,2015-01-01
...

等等。每封电子邮件都保证在数据集中至少有一个副本。

现在,要解决两个任务;我解决了其中一个,但我正在与另一个挣扎。我现在将完成这两项任务。

任务1(已解决): 对于每一行,对于每封电子邮件,返回一个附加列,其中包含使用此电子邮件创建第一条记录的用户的名称。

上述样本数据集的预期结果:

email,createdby,createdon,original_createdby
a@b.c,jsmith,2016-10-10,nsmythe
a@b.c,nsmythe,2016-09-09,nsmythe
a@b.c,vstark,2016-11-11,nsmythe
b@x.y,ajohnson,2015-02-03,elear
b@x.y,elear,2015-01-01,elear

获取上述代码:

;WITH   q0 -- this is just a security measure in case there are unique emails in the data set
          AS ( SELECT   t.email
               FROM     t
               GROUP BY t.email
               HAVING   COUNT(*) > 1) ,
        q1
          AS ( SELECT   q0.email
                      , createdon
                      , createdby
                      , ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
               FROM     t
               JOIN     q0
                        ON t.email = q0.email)
    SELECT  q1.email
          , q1.createdon
          , q1.createdby
          , LAG(q1.createdby, q1.rn - 1) OVER ( ORDER BY q1.email, q1.createdon ) original_createdby
    FROM    q1
    ORDER BY q1.email
          , q1.rn

简要说明:我通过电子邮件对数据进行分区,然后按创建日期排序每个分区中的行数,最后从第(rn-1)个记录返回[createdby]值。完全符合预期。

现在,与上面类似,有TASK 2:

任务2: 对于每一行,对于每封电子邮件,返回创建第一个副本的用户的名称。即rn = 2的用户名。

预期结果:

email,createdby,createdon,first_dupl_createdby
a@b.c,jsmith,2016-10-10,jsmith
a@b.c,nsmythe,2016-09-09,jsmith
a@b.c,vstark,2016-11-11,jsmith
b@x.y,ajohnson,2015-02-03,ajohnson
b@x.y,elear,2015-01-01,ajohnson

我希望保持高性能,以便尝试使用LEAD-LAG功能:

    WITH    q0
          AS ( SELECT   t.email
               FROM     t
               GROUP BY t.email
               HAVING   COUNT(*) > 1) ,
        q1
          AS ( SELECT   q0.email
                      , createdon
                      , createdby
                      , ROW_NUMBER() OVER ( PARTITION BY q0.email ORDER BY createdon ) rn
               FROM     t
               JOIN     q0
                        ON t.email = q0.email)
    SELECT  q1.email
          , q1.createdon
          , q1.createdby
          , q1.rn
          , CASE q1.rn
              WHEN 1 THEN LEAD(q1.createdby, 1) OVER ( ORDER BY q1.email, q1.createdon )
              ELSE LAG(q1.createdby, q1.rn - 2) OVER ( ORDER BY q1.email, q1.createdon )
            END AS first_dupl_createdby
    FROM    q1
    ORDER BY q1.email
          , q1.rn

说明:对于每个分区中的第一条记录,从以下记录返回[createdby](即从包含第一个副本的记录中)。对于同一分区中的所有其他记录,从(rn-2)记录前返回[createdby](即对于rn = 2,我们将保留在同一记录中,对于rn = 3,我们将返回1条记录,因为rn = 4 - 2记录等等。)

出现了一个问题
ELSE LAG(q1.createdby, q1.rn - 2)

操作。显然,对于任何逻辑,尽管存在前一行(WHEN 1 THEN ...),ELSE块也会被计算为rn = 1,导致传递给LAG函数的负偏移值:

Msg 8730,Level 16,State 2,Line 37 滞后和超前函数的偏移参数不能为负值。

当我注释掉ELSE行时,整个工作正常但很明显我没有在rn>的first_dupl_createdby列中得到任何结果。 1。

问题: 有没有办法重写上面的CASE语句(在TASK#2中),这样它总是从每个分区中rn = 2的记录中返回值,但是 - 这很重要 - 没有进行自JOIN操作(我知道我可以在一个单独的子查询中准备rn = 2的行,但这意味着要对整个表进行额外扫描,并运行不必要的自连接)。

3 个答案:

答案 0 :(得分:1)

您可以使用row_number()和条件汇总

获取每封电子邮件的信息
select email,
       max(case when seqnum = 1 then createdby end) as createdby_first,
       max(case when seqnum = 2 then createdby end) as createdby_second
from (select t.*,
             row_number() over (partition by email order by createdon) as seqnum
      from t
     ) t
group by email;

您可以join将此信息恢复为原始数据,以获取您想要的信息。我不知道lag()如何自然地用来解决这个问题。

答案 1 :(得分:1)

我认为您可以简单地使用max窗口函数,因为您尝试从每个分区获取rownumber = 2的值。

SELECT  q1.email
          , q1.createdon
          , q1.createdby
          , q1.rn
          , max(case when rn=2 then q1.createdby end) over(partition by q1.email) first_dup_created_by
FROM    q1
ORDER BY q1.email, q1.rn

您也可以使用类似的查询来获取第一个场景的rownumber = 1的结果。

答案 2 :(得分:0)

/耸肩

; WITH duplicate_email_addresses AS (
  SELECT email
  FROM   t
  GROUP
      BY email
  HAVING Count(*) > 1
)
, records_with_duplicate_email_addresses AS (
  SELECT email
       , createdon
       , createdby
       , Row_Number() OVER (PARTITION BY email ORDER BY createdon) AS sequencer
  FROM   t
  WHERE  EXISTS (
           SELECT *
           FROM   duplicate_email_addresses
           WHERE  email = t.email
         )
)
, second_duplicate_record AS ( -- Why do you need any more than this?
  SELECT email
       , createdon
       , createdby
  FROM   records_with_duplicate_email_addresses
  WHERE  sequencer = 2
)
SELECT records_with_duplicate_email_addresses.email
     , records_with_duplicate_email_addresses.createdon
     , records_with_duplicate_email_addresses.createdby
     , second_duplicate_record.createdby AS first_duplicate_createdby
FROM   records_with_duplicate_email_addresses
 INNER
  JOIN second_duplicate_record
    ON second_duplicate_record.email = records_with_duplicate_email_addresses.email
;