T-SQL:计算首次成功前的失败次数(2)

时间:2017-08-14 12:57:44

标签: sql tsql analytics vertica

我有一个数据库,其中包含任务尝试及其结果(失败或成功)的事件。对于每个用户,我想在第一次成功之前计算失败次数。随后的失败和成功不应该影响输出 - 我只对给定任务的第一次成功感兴趣。此外,DB包含具有应忽略的其他事件的行。

如何在Vertica数据库的T-SQL中制定它?

(我最终想计算每个任务的平均尝试次数,但是让我们将其排除在这个问题的范围之外,以保持可管理性。)

这是此处问题的更新: T-SQL: Count number of failures until first success

在最初的问题中,我给出了构造不良的样本数据,这些数据并没有完全反映我的使用场景,并且得出的答案并不适用于我的实际数据而且我无法&# 39; t验证。

解决方案不应该依赖行顺序 - 可能没有按时间戳顺序填充行。

这是数据库设置:

CREATE TABLE events {
      eventID int -- unused in this example, should be excluded from output
    , eventName varchar(256)
    , userName varchar(256)
    , timestamp timestamp
    , taskName varchar(256)
    , sessionID int -- unused in this example, should be excluded from output
};

INSERT INTO events
    VALUES
        (2363460186192576512, 'beginSession', 'John', '2017-08-14 09:46:46.712', NULL, 145031357)
      , (2363460852537008128, 'success', 'John', '2017-08-14 09:49:32.471', 'TaskOne', 145031357)
      , (2363461162974437376, 'success', 'John', '2017-08-14 09:50:48.781', 'TaskOne', 145031357)
      , (2363460390131740672, 'fail', 'John', '2017-08-14 09:47:37.349', 'TaskOne', 145031357)
      , (2363460556662710272, 'fail', 'John', '2017-08-14 09:48:23.024', 'TaskOne', 145031357)
      , (2363460730671505408, 'fail', 'John', '2017-08-14 09:48:58.646', 'TaskOne', 145031357)
      , (2363461032111800320, 'fail', 'John', '2017-08-14 09:50:10.726', 'TaskOne', 145031357)
      , (2363460389896859648, 'beginTask', 'John', '2017-08-14 09:47:05.32', 'TaskOne', 145031357)
      , (2363460463137751040, 'beginTask', 'John', '2017-08-14 09:47:52.166', 'TaskOne', 145031357)
      , (2363460556205531136, 'beginTask', 'John', '2017-08-14 09:48:12.615', 'TaskOne', 145031357)
      , (2363460692671205376, 'beginTask', 'John', '2017-08-14 09:48:36.155', 'TaskOne', 145031357)
      , (2363460852268572672, 'beginTask', 'John', '2017-08-14 09:49:12.047', 'TaskOne', 145031357)
      , (2363460962524327936, 'beginTask', 'John', '2017-08-14 09:49:47.951', 'TaskOne', 145031357)
      , (2363461162714390528, 'beginTask', 'John', '2017-08-14 09:50:23.645', 'TaskOne', 145031357)
      , (2363474741421064192, 'beginSession', 'John', '2017-08-14 10:44:36.042', NULL, 145031392)
      , (2363474885491200000, 'success', 'John', '2017-08-14 10:45:14.577', 'TaskTwo', 145031392)
      , (2363475342389641216, 'success', 'John', '2017-08-14 10:47:04.098', 'TaskTwo', 145031392)
      , (2363475473998635008, 'success', 'John', '2017-08-14 10:47:34.135', 'TaskOne', 145031392)
      , (2363475822079254528, 'success', 'John', '2017-08-14 10:48:53.381', 'TaskTwo', 145031392)
      , (2363476096949104640, 'success', 'John', '2017-08-14 10:50:07.441', 'TaskThree', 145031392)
      , (2363475066098266112, 'fail', 'John', '2017-08-14 10:45:53.526', 'TaskTwo', 145031392)
      , (2363475195152531456, 'fail', 'John', '2017-08-14 10:46:32.81', 'TaskTwo', 145031392)
      , (2363475654638821376, 'fail', 'John', '2017-08-14 10:48:13.71', 'TaskThree', 145031392)
      , (2363476247751114752, 'beginSession', 'Mike', '2017-08-14 10:50:37.67', NULL, 145030476)
      , (2363476335819063296, 'success', 'Mike', '2017-08-14 10:51:06.841', 'TaskOne', 145030476)
      , (2363476485643796480, 'success', 'Mike', '2017-08-14 10:51:41.086', 'TaskTwo', 145030476)
      , (2363476806063038464, 'success', 'Mike', '2017-08-14 10:52:53.174', 'TaskTwo', 145030476)
      , (2363477266119335936, 'success', 'Mike', '2017-08-14 10:54:32.053', 'TaskThree', 145030476)
      , (2363477619191631872, 'success', 'Mike', '2017-08-14 10:56:01.783', 'TaskThree', 145030476)
      , (2363476705131655168, 'fail', 'Mike', '2017-08-14 10:52:21.312', 'TaskThree', 145030476)
      , (2363476939634896896, 'fail', 'Mike', '2017-08-14 10:53:28.906', 'TaskThree', 145030476)
      , (2363477390937976832, 'fail', 'Mike', '2017-08-14 10:55:05.499', 'TaskThree', 145030476)
      , (2363476335592570880, 'beginTask', 'Mike', '2017-08-14 10:50:50.074', 'TaskOne', 145030476)
      , (2363476485501190144, 'beginTask', 'Mike', '2017-08-14 10:51:20.784', 'TaskTwo', 145030476)
      , (2363476704779333632, 'beginTask', 'Mike', '2017-08-14 10:51:54.829', 'TaskThree', 145030476)
      , (2363476805752659968, 'beginTask', 'Mike', '2017-08-14 10:52:34.001', 'TaskTwo', 145030476)
      , (2363476939496484864, 'beginTask', 'Mike', '2017-08-14 10:53:06.468', 'TaskThree', 145030476)
      , (2363477265938980864, 'beginTask', 'Mike', '2017-08-14 10:53:45.631', 'TaskThree', 145030476)
      , (2363477390635986944, 'beginTask', 'Mike', '2017-08-14 10:54:44.706', 'TaskThree', 145030476)
      , (2363477573427560448, 'beginTask', 'Mike', '2017-08-14 10:55:17.231', 'TaskThree', 145030476)
      , (2363474885214375936, 'beginTask', 'John', '2017-08-14 10:44:44.702', 'TaskTwo', 145031392)
      , (2363474985177161728, 'beginTask', 'John', '2017-08-14 10:45:31.133', 'TaskTwo', 145031392)
      , (2363475195014119424, 'beginTask', 'John', '2017-08-14 10:46:10.098', 'TaskTwo', 145031392)
      , (2363475342184120320, 'beginTask', 'John', '2017-08-14 10:46:45.357', 'TaskTwo', 145031392)
      , (2363475473616953344, 'beginTask', 'John', '2017-08-14 10:47:17.911', 'TaskOne', 145031392)
      , (2363475654437494784, 'beginTask', 'John', '2017-08-14 10:47:47.681', 'TaskThree', 145031392)
      , (2363475771776864256, 'beginTask', 'John', '2017-08-14 10:48:27.1', 'TaskTwo', 145031392)
      , (2363476006456762368, 'beginTask', 'John', '2017-08-14 10:49:06.151', 'TaskThree', 145031392)
    ;

根据这些数据,这是我试图实现的结果:

userName  taskName   numFailuresBeforeFirstSuccess
John      TaskOne    3
John      TaskTwo    0
John      TaskThree  1
Mike      TaskOne    0
Mike      TaskTwo    0
Mike      TaskThree  3

3 个答案:

答案 0 :(得分:1)

这是一种方法:

select e.username, e.taskname,
       sum(case when timestamp < first_success_ts and e.eventname = 'fail' then 1 else 0 end) as numFailuresBeforeSuccess
from (select e.*,
             min(case when e.eventname = 'success' then e.timestamp end) over
                (partition by e.username, e.taskname) as first_success_ts
      from events e
     ) e
group by e.username, e.taskname
order by e.username, e.taskname;

使用窗口函数计算第一次成功时间。这应该适用于两个数据库(至少在SQL Server 2012 +中)

答案 1 :(得分:1)

再一次,这是TSQL而不是Vertica,但只要Vertica支持CTE,它就是相当标准的SQL。

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts
        , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
        , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
    FROM #taskevents t1
    WHERE t1.eventName = 'Success'
)
SELECT s1.userName, s1.taskName, AVG(s1.failCount) AS avgFailCount
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn,  COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
        AND t2.taskName = cte1.taskName
        AND t2.ts < cte1.ts  
        AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
        AND  t2.eventName = 'fail'
    GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.userName, s1.taskName
ORDER BY s1.userName, s1.taskName

这给出了你的平均值。删除外部查询以获取我正在使用的数据。它产生的结果与您列出的结果略有不同,但应该给出您所说的正确平均值。如果我误解了要求,请告诉我。

注意:在我的测试数据中,我还添加了两个Fails没有成功的人,只是为了验证他们是否被排除在结果之外。

, (2363476006456762398, 'fail', 'Steve', '2017-08-14 11:29:06.151', 'Task42', 145031342)
, (2363476046456762368, 'fail', 'Joe', '2017-08-14 11:49:06.151', 'Task42', 145031399)

=====================================

结果

-----------------------------------
|userName|  taskName |avgFailCount|
-----------------------------------
|  John  | TaskOne   |     1      |
|  John  | TaskThree |     1      |
|  John  | TaskTwo   |     0      |
|  Mike  | TaskOne   |     0      |
|  Mike  | TaskThree |     1      |
|  Mike  | TaskTwo   |     0      |
-----------------------------------

=============================================== =========================

编辑:仅通过任务获得平均值:

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts
        , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
        , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn
    FROM #taskevents t1
    WHERE t1.eventName = 'Success'
)
SELECT s1.taskName
    , AVG(CAST(s1.failCount AS decimal(5,2))) AS avgFailCount
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn,  COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName
        AND t2.taskName = cte1.taskName
        AND t2.ts < cte1.ts  
        AND ( t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL )
        AND  t2.eventName = 'fail'
    GROUP BY cte1.userName, cte1.taskName, cte1.rn
) s1
GROUP BY s1.taskName
ORDER BY s1.taskName

哪个给你

--------------------------
| taskName  |avgFailCount|
--------------------------
| TaskOne   | 1.000000   |
| TaskThree | 1.333333   |
| TaskTwo   | 0.400000   |
--------------------------

基本上是

SELECT (3+1+0+0)/4.0  AS TaskOne
SELECT (0+2+0+0+0)/5.0 AS TaskTwo
SELECT (1+2+1)/3.0 AS TaskThree

来自以下数据点。

--------------------------------
|userName|  taskName |FailCount|
--------------------------------
|  John  | TaskOne   |    3    |
|  John  | TaskOne   |    1    |
|  John  | TaskOne   |    0    |
|  Mike  | TaskOne   |    0    |
|  John  | TaskTwo   |    0    |
|  John  | TaskTwo   |    2    |
|  John  | TaskTwo   |    0    |
|  Mike  | TaskTwo   |    0    |
|  Mike  | TaskTwo   |    0    |
|  John  | TaskThree |    1    |
|  Mike  | TaskThree |    2    |
|  Mike  | TaskThree |    1    |
--------------------------------

这是成功之前的平均失败次数,而不是每次尝试失败的平均次数。那会略有不同。

---------------------------------------------------
| task | fails | attempts | avg fails per attempt |
---------------------------------------------------
| Task1|   4   |    8     | 4/8 = 0.500000        |
| Task2|   2   |    7     | 2/7 = 0.285714        |
| Task3|   3   |    7     | 3/7 = 0.428571        |
---------------------------------------------------

答案 2 :(得分:0)

此查询:

 with F as
 (
    select *  from Evts where eventName = 'fail'
 ),

 S as
 (
    select * from Evts E 
        cross apply
        (
            select count(F.eventID) numFailuresBeforeFirstSuccess from F 
                where F.userName = E.userName and 
                      E.taskName = F.taskName and 
                      F.timestamp < E.timestamp
        ) K

    where eventName = 'success'
 )

 select userName, taskName, numFailuresBeforeFirstSuccess from      
    (select *, row_number() over (partition by userName, taskName order by [timestamp] desc) o from S ) S 
       where o = 1

得出这个结果:

userName    taskName    numFailuresBeforeFirstSuccess
----------- ----------- -----------------------------
John        TaskOne     4
John        TaskThree   1
John        TaskTwo     2
Mike        TaskOne     0
Mike        TaskThree   3
Mike        TaskTwo     0

此处previous explanation适用。

Rextester Demo