SQL通过连续递增序列分割数据&然后按模式进行子集化

时间:2017-10-05 11:22:02

标签: sql sql-server subset sequences

我有数据,我试图从中识别模式。但是,每个表中的数据都不完整(缺少行)。我想将表分成完整数据块,然后从每个数据中识别出模式。我有一个列,我可以使用它来确定数据是否完整或不是sequence

数据看起来像:

Sequence      Position 
1              open
2              closed 
3              open
4              open
5              closed
8              closed
9              open
11             open
13             closed
14             open 
15             open
18             closed
19             open
20             closed

首先,我想将数据拆分为完整的部分:

   Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
---------------------------
    8              closed
    9              open
---------------------------
    11             open
---------------------------
    13             closed
    14             open 
    15             open
---------------------------
    18             closed
    19             open
    20             closed

然后我想识别模式closed open, ..., open, closed,以便我们从闭合到开放n行(其中n至少为1)然后再回到关闭

从样本数据中可以得出:

     Sequence        Position 
        2              closed 
        3              open
        4              open
        5              closed
    ---------------------------
        18             closed
        19             open
        20             closed

这留下了我可以进行分析的最终表格,因为我知道没有破碎的序列。我还有另一个列,其中position是二进制的,如果它更容易使用。

表格很大,所以虽然我认为我可以编写循环来计算我的结果,但我认为这种方法不够高效。或者,我打算将整个表格拉到R,然后找到结果表,但这需要先将所有内容都放到R中,以便我想知道SQL是否可行

编辑:不同的样本数据更具代表性:

Sequence      Position 
    1              open
    2              closed 
    3              open
    4              open
    5              closed
    8              closed
    9              open
    11             open
    13             closed
    14             open 
    15             open
    18             closed
    19             open
    20             closed
    21             closed
    22             closed
    23             closed
    24             open
    25             open
    26             closed
    27             open

请注意,这应该具有相同的结果,但也可以使用

    23             closed
    24             open
    25             open
    26             closed

212227不适合closedopen ...,open,{{ 1}}模式

但如果我们有closed,我们会想要28 closed27,因为没有时间间隔且模式适合。如果不是28而是28,我们就不会想要29 closed27(因为虽然模式是正确的,但序列会中断)。

  

要添加一些上下文,请考虑从停止,运行到停止的计算机。我们记录数据,但记录中存在间隙,这些间隙由序列的破坏表示。以及停止运行停止周期中间的数据丢失;当机器已经运行时,数据有时也会开始记录,或者在机器停止之前停止记录。我不想要那些数据,因为它不是一个完整的停止,运行,停止循环。我只想要那些完整的循环,以及序列连续的地方。   这意味着我可以将原始数据集转换为一个接一个完整周期的数据集。

2 个答案:

答案 0 :(得分:1)

你可以使用它。

DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))

INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')


;WITH CTE AS(
    SELECT * ,
        CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
    FROM @MyTable
)
,CTE_2 AS 
(
    SELECT 
        [New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING )) 
        , [Sequence]
        , Position
     FROM CTE
)
,CTE_3 AS (
    SELECT *, 
    RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence]) 
    FROM CTE_2
)
,CTE_4 AS
(
    SELECT ([New_Sequence] - RN) G
    , MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
    , MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
    FROM CTE_3 
    GROUP BY ([New_Sequence] - RN)
)
SELECT
    CTE.Sequence, CTE.Position
FROM CTE_4 
    INNER JOIN CTE  ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
    CTE_4.MaxCloseSq > CTE_4.MinCloseSq
    AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)

结果:

Sequence    Position
----------- ----------
2           closed
3           open
4           open
5           closed
---         ---
18          closed
19          open
20          closed
---         ---
23          closed
24          open
25          open
26          closed

答案 1 :(得分:0)

我认为实际上有一种相对简单的方法来看待这个问题。您可以通过以下方式识别结账序列号:

  • 查看前一关闭的顺序
  • 查看上一次收盘和当前收盘的累计开盘数
  • 做算术以确保所有中间体都在数据中

这变成了一个查询:

select t.*,
       lag(sequence) over (partition by position order by sequence) as prev_sequence,
       lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
             sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
      from t
     ) t
where position = 'close' and
      (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
      sequence > prev_sequence - 1;

现在您已经识别了序列,您可以加入以获取原始行:

select t.*
from t join
     (select t.*,
             lag(sequence) over (partition by position order by sequence) as prev_sequence,
             lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
      from (select t.*,
                   sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
            from t
           ) t
      where position = 'close' and
            (cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
            sequence > prev_sequence - 1
     ) seqs
     on t.sequence between seqs.prev_sequence and seqs.sequence;

我承认我还没有测试过这个。不过,我认为这个想法很有效。一件事是它会选择多个"关闭"每个序列组的周期。