我有一个数据库,其中包含任务尝试及其结果(失败或成功)的事件。对于每个用户,我想在第一次成功之前计算失败次数。随后的失败和成功不应该影响输出 - 我只对给定任务的第一次成功感兴趣。此外,DB包含具有应忽略的其他事件的行。
如何在Vertica数据库的T-SQL中制定它?
(我最终想计算每个任务的平均尝试次数,但是让我们将其排除在这个问题的范围之外,以保持可管理性。)
这是此处问题的更新: T-SQL: Count number of failures until first success
在最初的问题中,我给出了构造不良的样本数据,这些数据并没有完全反映我的使用场景,并且得出的答案并不适用于我的实际数据而且我无法&# 39; t验证。
解决方案不应该依赖行顺序 - 可能没有按时间戳顺序填充行。
这是数据库设置:
CREATE TABLE events {
eventID int -- unused in this example, should be excluded from output
, eventName varchar(256)
, userName varchar(256)
, timestamp timestamp
, taskName varchar(256)
, sessionID int -- unused in this example, should be excluded from output
};
INSERT INTO events
VALUES
(2363460186192576512, 'beginSession', 'John', '2017-08-14 09:46:46.712', NULL, 145031357)
, (2363460852537008128, 'success', 'John', '2017-08-14 09:49:32.471', 'TaskOne', 145031357)
, (2363461162974437376, 'success', 'John', '2017-08-14 09:50:48.781', 'TaskOne', 145031357)
, (2363460390131740672, 'fail', 'John', '2017-08-14 09:47:37.349', 'TaskOne', 145031357)
, (2363460556662710272, 'fail', 'John', '2017-08-14 09:48:23.024', 'TaskOne', 145031357)
, (2363460730671505408, 'fail', 'John', '2017-08-14 09:48:58.646', 'TaskOne', 145031357)
, (2363461032111800320, 'fail', 'John', '2017-08-14 09:50:10.726', 'TaskOne', 145031357)
, (2363460389896859648, 'beginTask', 'John', '2017-08-14 09:47:05.32', 'TaskOne', 145031357)
, (2363460463137751040, 'beginTask', 'John', '2017-08-14 09:47:52.166', 'TaskOne', 145031357)
, (2363460556205531136, 'beginTask', 'John', '2017-08-14 09:48:12.615', 'TaskOne', 145031357)
, (2363460692671205376, 'beginTask', 'John', '2017-08-14 09:48:36.155', 'TaskOne', 145031357)
, (2363460852268572672, 'beginTask', 'John', '2017-08-14 09:49:12.047', 'TaskOne', 145031357)
, (2363460962524327936, 'beginTask', 'John', '2017-08-14 09:49:47.951', 'TaskOne', 145031357)
, (2363461162714390528, 'beginTask', 'John', '2017-08-14 09:50:23.645', 'TaskOne', 145031357)
, (2363474741421064192, 'beginSession', 'John', '2017-08-14 10:44:36.042', NULL, 145031392)
, (2363474885491200000, 'success', 'John', '2017-08-14 10:45:14.577', 'TaskTwo', 145031392)
, (2363475342389641216, 'success', 'John', '2017-08-14 10:47:04.098', 'TaskTwo', 145031392)
, (2363475473998635008, 'success', 'John', '2017-08-14 10:47:34.135', 'TaskOne', 145031392)
, (2363475822079254528, 'success', 'John', '2017-08-14 10:48:53.381', 'TaskTwo', 145031392)
, (2363476096949104640, 'success', 'John', '2017-08-14 10:50:07.441', 'TaskThree', 145031392)
, (2363475066098266112, 'fail', 'John', '2017-08-14 10:45:53.526', 'TaskTwo', 145031392)
, (2363475195152531456, 'fail', 'John', '2017-08-14 10:46:32.81', 'TaskTwo', 145031392)
, (2363475654638821376, 'fail', 'John', '2017-08-14 10:48:13.71', 'TaskThree', 145031392)
, (2363476247751114752, 'beginSession', 'Mike', '2017-08-14 10:50:37.67', NULL, 145030476)
, (2363476335819063296, 'success', 'Mike', '2017-08-14 10:51:06.841', 'TaskOne', 145030476)
, (2363476485643796480, 'success', 'Mike', '2017-08-14 10:51:41.086', 'TaskTwo', 145030476)
, (2363476806063038464, 'success', 'Mike', '2017-08-14 10:52:53.174', 'TaskTwo', 145030476)
, (2363477266119335936, 'success', 'Mike', '2017-08-14 10:54:32.053', 'TaskThree', 145030476)
, (2363477619191631872, 'success', 'Mike', '2017-08-14 10:56:01.783', 'TaskThree', 145030476)
, (2363476705131655168, 'fail', 'Mike', '2017-08-14 10:52:21.312', 'TaskThree', 145030476)
, (2363476939634896896, 'fail', 'Mike', '2017-08-14 10:53:28.906', 'TaskThree', 145030476)
, (2363477390937976832, 'fail', 'Mike', '2017-08-14 10:55:05.499', 'TaskThree', 145030476)
, (2363476335592570880, 'beginTask', 'Mike', '2017-08-14 10:50:50.074', 'TaskOne', 145030476)
, (2363476485501190144, 'beginTask', 'Mike', '2017-08-14 10:51:20.784', 'TaskTwo', 145030476)
, (2363476704779333632, 'beginTask', 'Mike', '2017-08-14 10:51:54.829', 'TaskThree', 145030476)
, (2363476805752659968, 'beginTask', 'Mike', '2017-08-14 10:52:34.001', 'TaskTwo', 145030476)
, (2363476939496484864, 'beginTask', 'Mike', '2017-08-14 10:53:06.468', 'TaskThree', 145030476)
, (2363477265938980864, 'beginTask', 'Mike', '2017-08-14 10:53:45.631', 'TaskThree', 145030476)
, (2363477390635986944, 'beginTask', 'Mike', '2017-08-14 10:54:44.706', 'TaskThree', 145030476)
, (2363477573427560448, 'beginTask', 'Mike', '2017-08-14 10:55:17.231', 'TaskThree', 145030476)
, (2363474885214375936, 'beginTask', 'John', '2017-08-14 10:44:44.702', 'TaskTwo', 145031392)
, (2363474985177161728, 'beginTask', 'John', '2017-08-14 10:45:31.133', 'TaskTwo', 145031392)
, (2363475195014119424, 'beginTask', 'John', '2017-08-14 10:46:10.098', 'TaskTwo', 145031392)
, (2363475342184120320, 'beginTask', 'John', '2017-08-14 10:46:45.357', 'TaskTwo', 145031392)
, (2363475473616953344, 'beginTask', 'John', '2017-08-14 10:47:17.911', 'TaskOne', 145031392)
, (2363475654437494784, 'beginTask', 'John', '2017-08-14 10:47:47.681', 'TaskThree', 145031392)
, (2363475771776864256, 'beginTask', 'John', '2017-08-14 10:48:27.1', 'TaskTwo', 145031392)
, (2363476006456762368, 'beginTask', 'John', '2017-08-14 10:49:06.151', 'TaskThree', 145031392)
;
根据这些数据,这是我试图实现的结果:
userName taskName numFailuresBeforeFirstSuccess
John TaskOne 3
John TaskTwo 0
John TaskThree 1
Mike TaskOne 0
Mike TaskTwo 0
Mike TaskThree 3
答案 0 :(得分:1)
这是一种方法:
select e.username, e.taskname,
sum(case when timestamp < first_success_ts and e.eventname = 'fail' then 1 else 0 end) as numFailuresBeforeSuccess
from (select e.*,
min(case when e.eventname = 'success' then e.timestamp end) over
(partition by e.username, e.taskname) as first_success_ts
from events e
) e
group by e.username, e.taskname
order by e.username, e.taskname;
使用窗口函数计算第一次成功时间。这应该适用于两个数据库(至少在SQL Server 2012 +中)
答案 1 :(得分:1)
再一次,这是TSQL而不是Vertica,但只要Vertica支持CTE,它就是相当标准的SQL。
; WITH cte1 AS (
SELECT t1.userName, t1.taskName, t1.ts
, LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS
, ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
FROM #taskevents t1
WHERE t1.eventName = 'Success'
)
SELECT s1.userName, s1.taskName, AVG(s1.failCount) AS avgFailCount
FROM (
SELECT cte1.userName, cte1.taskName , cte1.rn, COALESCE(COUNT(t2.ts),0) AS failCount
FROM cte1
LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
AND t2.taskName = cte1.taskName
AND t2.ts < cte1.ts
AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
AND t2.eventName = 'fail'
GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.userName, s1.taskName
ORDER BY s1.userName, s1.taskName
这给出了你的平均值。删除外部查询以获取我正在使用的数据。它产生的结果与您列出的结果略有不同,但应该给出您所说的正确平均值。如果我误解了要求,请告诉我。
注意:在我的测试数据中,我还添加了两个Fails没有成功的人,只是为了验证他们是否被排除在结果之外。
, (2363476006456762398, 'fail', 'Steve', '2017-08-14 11:29:06.151', 'Task42', 145031342)
, (2363476046456762368, 'fail', 'Joe', '2017-08-14 11:49:06.151', 'Task42', 145031399)
=====================================
结果
-----------------------------------
|userName| taskName |avgFailCount|
-----------------------------------
| John | TaskOne | 1 |
| John | TaskThree | 1 |
| John | TaskTwo | 0 |
| Mike | TaskOne | 0 |
| Mike | TaskThree | 1 |
| Mike | TaskTwo | 0 |
-----------------------------------
=============================================== =========================
编辑:仅通过任务获得平均值:
; WITH cte1 AS (
SELECT t1.userName, t1.taskName, t1.ts
, LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS
, ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
FROM #taskevents t1
WHERE t1.eventName = 'Success'
)
SELECT s1.taskName
, AVG(CAST(s1.failCount AS decimal(5,2))) AS avgFailCount
FROM (
SELECT cte1.userName, cte1.taskName , cte1.rn, COALESCE(COUNT(t2.ts),0) AS failCount
FROM cte1
LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
AND t2.taskName = cte1.taskName
AND t2.ts < cte1.ts
AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
AND t2.eventName = 'fail'
GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.taskName
ORDER BY s1.taskName
哪个给你
--------------------------
| taskName |avgFailCount|
--------------------------
| TaskOne | 1.000000 |
| TaskThree | 1.333333 |
| TaskTwo | 0.400000 |
--------------------------
基本上是
SELECT (3+1+0+0)/4.0 AS TaskOne
SELECT (0+2+0+0+0)/5.0 AS TaskTwo
SELECT (1+2+1)/3.0 AS TaskThree
来自以下数据点。
--------------------------------
|userName| taskName |FailCount|
--------------------------------
| John | TaskOne | 3 |
| John | TaskOne | 1 |
| John | TaskOne | 0 |
| Mike | TaskOne | 0 |
| John | TaskTwo | 0 |
| John | TaskTwo | 2 |
| John | TaskTwo | 0 |
| Mike | TaskTwo | 0 |
| Mike | TaskTwo | 0 |
| John | TaskThree | 1 |
| Mike | TaskThree | 2 |
| Mike | TaskThree | 1 |
--------------------------------
这是成功之前的平均失败次数,而不是每次尝试失败的平均次数。那会略有不同。
---------------------------------------------------
| task | fails | attempts | avg fails per attempt |
---------------------------------------------------
| Task1| 4 | 8 | 4/8 = 0.500000 |
| Task2| 2 | 7 | 2/7 = 0.285714 |
| Task3| 3 | 7 | 3/7 = 0.428571 |
---------------------------------------------------
答案 2 :(得分:0)
此查询:
with F as
(
select * from Evts where eventName = 'fail'
),
S as
(
select * from Evts E
cross apply
(
select count(F.eventID) numFailuresBeforeFirstSuccess from F
where F.userName = E.userName and
E.taskName = F.taskName and
F.timestamp < E.timestamp
) K
where eventName = 'success'
)
select userName, taskName, numFailuresBeforeFirstSuccess from
(select *, row_number() over (partition by userName, taskName order by [timestamp] desc) o from S ) S
where o = 1
得出这个结果:
userName taskName numFailuresBeforeFirstSuccess
----------- ----------- -----------------------------
John TaskOne 4
John TaskThree 1
John TaskTwo 2
Mike TaskOne 0
Mike TaskThree 3
Mike TaskTwo 0
此处previous explanation适用。