我有数据,我试图从中识别模式。但是,每个表中的数据都不完整(缺少行)。我想将表分成完整数据块,然后从每个数据中识别出模式。我有一个列,我可以使用它来确定数据是否完整或不是sequence
。
数据看起来像:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
首先,我想将数据拆分为完整的部分:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
---------------------------
8 closed
9 open
---------------------------
11 open
---------------------------
13 closed
14 open
15 open
---------------------------
18 closed
19 open
20 closed
然后我想识别模式closed open, ..., open, closed
,以便我们从闭合到开放n行(其中n至少为1)然后再回到关闭
从样本数据中可以得出:
Sequence Position
2 closed
3 open
4 open
5 closed
---------------------------
18 closed
19 open
20 closed
这留下了我可以进行分析的最终表格,因为我知道没有破碎的序列。我还有另一个列,其中position
是二进制的,如果它更容易使用。
表格很大,所以虽然我认为我可以编写循环来计算我的结果,但我认为这种方法不够高效。或者,我打算将整个表格拉到R
,然后找到结果表,但这需要先将所有内容都放到R
中,以便我想知道SQL
是否可行
编辑:不同的样本数据更具代表性:
Sequence Position
1 open
2 closed
3 open
4 open
5 closed
8 closed
9 open
11 open
13 closed
14 open
15 open
18 closed
19 open
20 closed
21 closed
22 closed
23 closed
24 open
25 open
26 closed
27 open
请注意,这应该具有相同的结果,但也可以使用
23 closed
24 open
25 open
26 closed
21
,22
和27
不适合closed
,open
...,open
,{{ 1}}模式
但如果我们有closed
,我们会想要28 closed
和27
,因为没有时间间隔且模式适合。如果不是28
而是28
,我们就不会想要29 closed
或27
(因为虽然模式是正确的,但序列会中断)。
要添加一些上下文,请考虑从停止,运行到停止的计算机。我们记录数据,但记录中存在间隙,这些间隙由序列的破坏表示。以及停止运行停止周期中间的数据丢失;当机器已经运行时,数据有时也会开始记录,或者在机器停止之前停止记录。我不想要那些数据,因为它不是一个完整的停止,运行,停止循环。我只想要那些完整的循环,以及序列连续的地方。 这意味着我可以将原始数据集转换为一个接一个完整周期的数据集。
答案 0 :(得分:1)
你可以使用它。
DECLARE @MyTable TABLE (Sequence INT, Position VARCHAR(10))
INSERT INTO @MyTable
VALUES
(1,'open'),
(2,'closed') ,
(3,'open'),
(4,'open'),
(5,'closed'),
(8,'closed'),
(9,'open'),
(11,'open'),
(13,'closed'),
(14,'open') ,
(15,'open'),
(18,'closed'),
(19,'open'),
(20,'closed'),
(21,'closed'),
(22,'closed'),
(23,'closed'),
(24,'open'),
(25,'open'),
(26,'closed'),
(27,'open')
;WITH CTE AS(
SELECT * ,
CASE WHEN Position ='closed' AND LAG(Position) OVER(ORDER BY [Sequence]) ='closed' THEN 1 ELSE 0 END CloseMark
FROM @MyTable
)
,CTE_2 AS
(
SELECT
[New_Sequence] = [Sequence] + (SUM(CloseMark) OVER(ORDER BY [Sequence] ROWS UNBOUNDED PRECEDING ))
, [Sequence]
, Position
FROM CTE
)
,CTE_3 AS (
SELECT *,
RN = ROW_NUMBER() OVER(ORDER BY [New_Sequence])
FROM CTE_2
)
,CTE_4 AS
(
SELECT ([New_Sequence] - RN) G
, MIN(CASE WHEN Position = 'closed' THEN [Sequence] END) MinCloseSq
, MAX(CASE WHEN Position = 'closed' THEN [Sequence] END) MaxCloseSq
FROM CTE_3
GROUP BY ([New_Sequence] - RN)
)
SELECT
CTE.Sequence, CTE.Position
FROM CTE_4
INNER JOIN CTE ON (CTE.Sequence BETWEEN CTE_4.MinCloseSq AND CTE_4.MaxCloseSq)
WHERE
CTE_4.MaxCloseSq > CTE_4.MinCloseSq
AND (CTE_4.MaxCloseSq IS NOT NULL AND CTE_4.MinCloseSq IS NOT NULL)
结果:
Sequence Position
----------- ----------
2 closed
3 open
4 open
5 closed
--- ---
18 closed
19 open
20 closed
--- ---
23 closed
24 open
25 open
26 closed
答案 1 :(得分:0)
我认为实际上有一种相对简单的方法来看待这个问题。您可以通过以下方式识别结账序列号:
这变成了一个查询:
select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1;
现在您已经识别了序列,您可以加入以获取原始行:
select t.*
from t join
(select t.*,
lag(sequence) over (partition by position order by sequence) as prev_sequence,
lag(cume_opens) over (partition by position order by cume_opens) as prev_cume_opens
from (select t.*,
sum(case when position = 'open' then 1 else 0 end) over (order by sequence) as cume_opens
from t
) t
where position = 'close' and
(cume_opens - prev_cume_opens) = sequence - prev_sequence - 1 and
sequence > prev_sequence - 1
) seqs
on t.sequence between seqs.prev_sequence and seqs.sequence;
我承认我还没有测试过这个。不过,我认为这个想法很有效。一件事是它会选择多个"关闭"每个序列组的周期。