Question

我正在处理测试器数据库的大型数据集（每天15万个）。每行包含有关产品特定测试的数据。每个测试人员都会插入测试结果。

我想进行一些测量，例如通过每个产品和测试人员的班次的不合格率。问题是没有分配批号，所以我不能轻易选择。

考虑整个表格的给定子选择：

 id   tBegin                orderId   
------------------------------------
 1    2018-10-20 00:00:05   1
 2    2018-10-20 00:05:15   1
 3    2018-10-20 01:00:05   1
 10   2018-10-20 10:03:05   3
 12   2018-10-20 11:04:05   8
 20   2018-10-20 14:15:05   3
 37   2018-10-20 18:12:05   1

我的目标是将数据聚类到以下内容

 id   tBegin                orderId   pCount 
--------------------------------------------
 1    2018-10-20 00:00:05   1         3
 10   2018-10-20 10:03:05   3         1
 12   2018-10-20 11:04:05   8         1
 20   2018-10-20 14:15:05   3         1
 37   2018-10-20 18:12:05   1         1

简单的GROUP BY orderID不会解决问题，所以我想到了以下内容

SELECT 
  MIN(c.id) AS id,
  MIN(c.tBegin) AS tBegin,
  c.orderId,
  COUNT(*) AS pCount
FROM (
    SELECT t2.id, t2.tBegin, t2.orderId,
      ( SELECT TOP 1 t.id
        FROM history t
        WHERE t.tBegin > t2.tBegin
          AND t.orderID <> t2.orderID
          AND <restrict date here further>
        ORDER BY t.tBegin 
       ) AS nextId
    FROM history t2 
) AS c
WHERE <restrict date here>
GROUP BY c.orderID, c.nextId

我遗漏了选择正确日期和测试仪的WHERE。这可行，但是接缝效率很低。我曾使用小型数据库，但对SQL Server 2017还是陌生的。

非常感谢您的帮助！

Answer 1

您可以为此使用窗口功能：

sparse_categorical_crossentropy

第一个cte为更改值的每一行分配一个“更改标志”
第二个cte使用运行总和将1和0转换为可用于对行进行分组的数字
最后，您为每个组中的行编号，然后选择每个组的第一行

Demo on DB Fiddle

Answer 2

您可以使用累积方法：

select min(id) as id, max(tBegin), orderid, count(*) 
from (select h.*,
             row_number() over (order by id) as seq1,
             row_number() over (partition by orderid order by id) as seq2
      from history h
     ) h
group by orderid, (seq1 - seq2)
order by id;

集群时间线或重建批号的有效方法

2 个答案: