我需要过滤数据,在(y)组上只有一个值(x) 聚合可以是均值(y) 问题:(x)上的一组必须连续超过(日期)
以下是数据的示例
| DATE | x | y |
---------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 | 3 |
| 2017-03-12 13:52:55.607 | 77.01 | 5 |
| 2017-03-12 13:53:54.920 | 78.89 | 7 |
| 2017-03-12 13:54:12.320 | 78.89 | 8 |
| 2017-03-12 13:54:50.287 | 78.89 | 6 |
| 2017-03-12 13:56:07.130 | 89.31 | 5 |
| 2017-03-12 13:56:44.997 | 89.31 | 4 |
| 2017-03-12 13:59:55.200 | 16.13 | 9 |
| 2017-03-12 13:59:55.400 | 16.13 | 10 |
| 2017-03-12 14:00:33.240 | 16.13 | 13 |
| 2017-03-12 14:03:04.450 | 19.01 | 8 |
| 2017-03-12 14:04:59.250 | 77.01 | 12 |
| 2017-03-12 14:05:37.707 | 77.01 | 15 |
| 2017-03-12 14:07:30.517 | 77.01 | 14 |
| 2017-03-12 14:08:29.757 | 78.89 | 8 |
到目前为止:查看(x)
上的值77.01的问题| DATE | x | y | Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 | 3 | 1 |
| 2017-03-12 13:52:55.607 | 77.01 | 5 | 1 |
| 2017-03-12 13:53:54.920 | 78.89 | 7 | 2 |
| 2017-03-12 13:54:12.320 | 78.89 | 8 | 2 |
| 2017-03-12 13:54:50.287 | 78.89 | 6 | 2 |
| 2017-03-12 13:56:07.130 | 89.31 | 5 | 3 |
| 2017-03-12 13:56:44.997 | 89.31 | 4 | 3 |
| 2017-03-12 13:59:55.200 | 16.13 | 9 | 4 |
| 2017-03-12 13:59:55.400 | 16.13 | 10 | 4 |
| 2017-03-12 14:00:33.240 | 16.13 | 13 | 4 |
| 2017-03-12 14:03:04.450 | 19.01 | 8 | 5 |
| 2017-03-12 14:04:59.250 | 77.01 | 12 | 1 |-
| 2017-03-12 14:05:37.707 | 77.01 | 15 | 1 |-- must be group 6 not 1
| 2017-03-12 14:07:30.517 | 77.01 | 14 | 1 |-
| 2017-03-12 14:08:29.757 | 78.89 | 8 | 6 |
我想要的是什么:
| DATE | x | y | Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 | 3 | 1 |
| 2017-03-12 13:52:55.607 | 77.01 | 5 | 1 |
| 2017-03-12 13:53:54.920 | 78.89 | 7 | 2 |
| 2017-03-12 13:54:12.320 | 78.89 | 8 | 2 |
| 2017-03-12 13:54:50.287 | 78.89 | 6 | 2 |
| 2017-03-12 13:56:07.130 | 89.31 | 5 | 3 |
| 2017-03-12 13:56:44.997 | 89.31 | 4 | 3 |
| 2017-03-12 13:59:55.200 | 16.13 | 9 | 4 |
| 2017-03-12 13:59:55.400 | 16.13 | 10 | 4 |
| 2017-03-12 14:00:33.240 | 16.13 | 13 | 4 |
| 2017-03-12 14:03:04.450 | 19.01 | 8 | 5 |
| 2017-03-12 14:04:59.250 | 77.01 | 12 | 6 |
| 2017-03-12 14:05:37.707 | 77.01 | 15 | 6 |
| 2017-03-12 14:07:30.517 | 77.01 | 14 | 6 |
| 2017-03-12 14:08:29.757 | 78.89 | 8 | 7 |
因此可以通过Grp
获得均值(y)| DATE | x |Mean(y)| Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 | 4 | 1 |
| 2017-03-12 13:53:54.920 | 78.89 | 7 | 2 |
| 2017-03-12 13:56:07.130 | 89.31 | 4.5 | 3 |
| 2017-03-12 13:59:55.200 | 16.13 | 10.66 | 4 |
| 2017-03-12 14:03:04.450 | 19.01 | 8 | 5 |
| 2017-03-12 14:04:59.250 | 77.01 | 13.66 | 6 |
| 2017-03-12 14:08:29.757 | 78.89 | 8 | 7 |
我尝试使用GROUPBY或OVER,但每次遇到值为77,01的问题。我只审查了一个小组
SELECT [TS.DATE], TS.X, t_index = DENSE_RANK() OVER (ORDER BY TS.X)
FROM TS
有人可以帮助我吗? 谢谢。
PS:对我的英语道歉
答案 0 :(得分:3)
您可以使用行号的差异来识别组:
select t.*,
dense_rank() over (order by x, (seqnum - seqnum_x)) as grp
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by x order by date) as seqnum_x
from t
) t;
这里的逻辑有点棘手。要理解它,请运行子查询。盯着结果,你就会明白为什么差异能够识别每个群体。
获得平均值:
select t.x, avg(y),
min(date), max(date),
dense_rank() over (order min(date)) as grp
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by x order by date) as seqnum_x
from t
) t
group by x, (seqnum - seqnum_x)
这会按顺序生成组编号(因为分配是在聚合之后)。
以上标识每个组,但原始查询不按日期顺序生成组编号。执行此操作的替代方法是使用lag()
和累计sum()
:
select t.*,
sum(case when prev_x = x then 0 else 1 end) over (order by date) as grp
from (select t.*,
lag(x) over (order by date) as prev_x
from t
) t;
这里的逻辑更简单。只需查看上一个值,并在更改时添加一个标志。
答案 1 :(得分:0)
下面的解决方案使用了很多窗口函数,但得到了结果。
日期值基于中值。
它使用与Gordon Linoff相同的技巧,减去2个rownumbers以获得有用的排名。
select max([DATE]), X, avg_y as [Mean(y)], row_number() over (order by max([DATE])) as Grp
from (
select *,
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Y) OVER (PARTITION BY X, RN_DIFF) as median_y,
CAST(AVG(Y * 1.0) OVER (PARTITION BY X, RN_DIFF) AS DECIMAL(8,2)) as avg_y
from(
select [DATE], X, Y,
row_number() over (order by [DATE]) - row_number() over (partition by x order by [DATE]) as RN_DIFF
from TS t
) q1
) q2
where median_y = y
group by X, avg_y, RN_DIFF
order by max([DATE]);