我有一个如下所示的数据集:
对于每个OwnerID
,我想计算当前记录的列creationtime
与下一条记录(对于同一ownerID
)的差异,表格中一列新TimeDiff
。我相信这里需要自我加入,但我不确定如何使用自联接来计算当前记录和下一条记录之间的差异。
执行此操作时,任何ownerID
的最后一条记录的默认值均为' NA'因为它不会成为下一条记录(同一ownerID
)来计算差异。
这是我用来获取此数据集的查询:
SELECT DISTINCT ga.ownerid,
mr.name,
SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
EXTRACT(DAY FROM ga.creationtime) AS DAY,
EXTRACT(DOW FROM ga.creationtime) AS DOW,
ga.creationtime,
a.encodedid,
a.name
FROM flx2.groupactivities ga
JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
JOIN flx2.memberroles mr ON mr.id = mhr.roleid
WHERE ga.activitytype = 'assign'
AND ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
AND a.artifacttypeid = 54
AND a.encodedid IS NOT NULL
ORDER BY ga.ownerid,
ga.creationtime,
a.encodedid
我使用Amazon Redshift
来获取此数据。
任何帮助都将不胜感激。
TIA!
更新
我使用了@systemjack建议的方法。以下是我得到的结果:
我们在这里可以清楚地注意到encodedid
列正在重复assignmentID
(MAT.PRB.410
,如上图中突出显示的那样),这不应该是案件。在上面提到的查询中,如果没有LEAD
函数,则不会发生这种情况。这是我正在使用的更新查询(只有一个额外的LEAD
函数):
SELECT DISTINCT ga.ownerid,
mr.name,
SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
EXTRACT(DAY FROM ga.creationtime) AS DAY,
EXTRACT(DOW FROM ga.creationtime) AS DOW,
ga.creationtime,
LEAD(ga.creationtime,1) OVER (PARTITION BY ga.ownerid ORDER BY ga.creationtime) AS nexttime,
a.encodedid,
a.name
FROM flx2.groupactivities ga
JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
JOIN flx2.memberroles mr ON mr.id = mhr.roleid
WHERE ga.activitytype = 'assign'
AND ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
AND a.artifacttypeid = 54
AND a.encodedid IS NOT NULL
ORDER BY ga.ownerid,
ga.creationtime,
a.encodedid LIMIT 1000
nexttime
列中的值似乎也被抬高了。它似乎在ocassion的creationtime
列中取下一个值。例如:在第二条记录中,nexttime
列的值应该是2013-09-18 06:14:59
而不是2014-01-18 12:16:49
为什么我们获得的记录超出预期?我该如何解决这些问题?
答案 0 :(得分:2)
更新:这看起来更好吗?
with dataset as (
SELECT DISTINCT ga.ownerid,
mr.name,
SPLIT_PART(SPLIT_PART(ga.activitydata,' ',2),',',1) AS Assignmentid,
EXTRACT(YEAR FROM ga.creationtime) AS YEAR,
EXTRACT(MONTH FROM ga.creationtime) AS MONTH,
EXTRACT(DAY FROM ga.creationtime) AS DAY,
EXTRACT(DOW FROM ga.creationtime) AS DOW,
ga.creationtime,
a.encodedid,
a.name
FROM flx2.groupactivities ga
JOIN flx2.memberstudytrackitemstatus mstis ON SPLIT_PART (SPLIT_PART (ga.activitydata,' ',2),',',1) = mstis.assignmentid
JOIN flx2.artifacts a ON mstis.studytrackitemid = a.id
JOIN auth.memberhasroles mhr ON mhr.memberid = ga.ownerid
JOIN flx2.memberroles mr ON mr.id = mhr.roleid
WHERE ga.activitytype = 'assign'
AND ga.ownerid NOT IN (SELECT memberid FROM auth.memberhasroles WHERE roleid = 25)
AND a.artifacttypeid = 54
AND a.encodedid IS NOT NULL
)
select d.*,
LEAD(creationtime,1) OVER (PARTITION BY ownerid ORDER BY creationtime) AS nexttime
from dataset d
ORDER BY ownerid, creationtime, encodedid, nextime
LIMIT 1000
这样的事情(未经测试的代码)可能会起作用。想法是使用LEAD window function为每个所有者获取以下记录的creationtime
,如果它是最后一条记录则为空,然后使用常规{ {3}}获得你想要的单位。外部查询中的DATEDIFF语句处理最后一个记录边缘情况,您可以调整它以获得您想要的结果。
select ownerid, creationtime,
case when nextime is not null
then datediff('second', creationtime, nextime)
else datediff('second', creationtime, sysdate)
end as timediff
from (
select distinct ownerid, creationtime,
lead(creationtime,1) over (partition by ownerid order by creationtime) as nexttime
from yourdata
)
答案 1 :(得分:1)
我个人认为没有声明(纯SQL)方法来实现这一点。抱歉。你不能在集合中的特定记录中引用值(即使它是下一个还是上一个),这本质上也是如此。
所以我可以在这里看到三种方式:
1)对SQL使用过程扩展(MySQL也有)。
2)获取整套并在外部处理,在"客户端" (到RDBMS)方。
3)将timediff列添加到表+ AFTER INSERT / UPDATE触发器中,您将计算该差异并附加记录。