如何获得非连续组的平均值

时间:2017-03-17 09:20:56

标签: sql sql-server sql-server-2012

我需要过滤数据,在(y)组上只有一个值(x) 聚合可以是均值(y) 问题:(x)上的一组必须连续超过(日期)

以下是数据的示例

| DATE                    |   x   |  y |
---------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 |  3 |
| 2017-03-12 13:52:55.607 | 77.01 |  5 |
| 2017-03-12 13:53:54.920 | 78.89 |  7 |
| 2017-03-12 13:54:12.320 | 78.89 |  8 |
| 2017-03-12 13:54:50.287 | 78.89 |  6 |
| 2017-03-12 13:56:07.130 | 89.31 |  5 |
| 2017-03-12 13:56:44.997 | 89.31 |  4 |
| 2017-03-12 13:59:55.200 | 16.13 |  9 |
| 2017-03-12 13:59:55.400 | 16.13 | 10 |
| 2017-03-12 14:00:33.240 | 16.13 | 13 |
| 2017-03-12 14:03:04.450 | 19.01 |  8 |
| 2017-03-12 14:04:59.250 | 77.01 | 12 |
| 2017-03-12 14:05:37.707 | 77.01 | 15 |
| 2017-03-12 14:07:30.517 | 77.01 | 14 |
| 2017-03-12 14:08:29.757 | 78.89 |  8 |

到目前为止:查看(x)

上的值77.01的问题
| DATE                    |   x   |  y | Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 |  3 |  1  |
| 2017-03-12 13:52:55.607 | 77.01 |  5 |  1  |
| 2017-03-12 13:53:54.920 | 78.89 |  7 |  2  |
| 2017-03-12 13:54:12.320 | 78.89 |  8 |  2  |
| 2017-03-12 13:54:50.287 | 78.89 |  6 |  2  |
| 2017-03-12 13:56:07.130 | 89.31 |  5 |  3  |
| 2017-03-12 13:56:44.997 | 89.31 |  4 |  3  |
| 2017-03-12 13:59:55.200 | 16.13 |  9 |  4  |
| 2017-03-12 13:59:55.400 | 16.13 | 10 |  4  |
| 2017-03-12 14:00:33.240 | 16.13 | 13 |  4  |
| 2017-03-12 14:03:04.450 | 19.01 |  8 |  5  |
| 2017-03-12 14:04:59.250 | 77.01 | 12 |  1  |-
| 2017-03-12 14:05:37.707 | 77.01 | 15 |  1  |-- must be group 6 not 1
| 2017-03-12 14:07:30.517 | 77.01 | 14 |  1  |-
| 2017-03-12 14:08:29.757 | 78.89 |  8 |  6  |

我想要的是什么:

| DATE                    |   x   |  y | Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 |  3 |  1  |
| 2017-03-12 13:52:55.607 | 77.01 |  5 |  1  |
| 2017-03-12 13:53:54.920 | 78.89 |  7 |  2  |
| 2017-03-12 13:54:12.320 | 78.89 |  8 |  2  |
| 2017-03-12 13:54:50.287 | 78.89 |  6 |  2  |
| 2017-03-12 13:56:07.130 | 89.31 |  5 |  3  |
| 2017-03-12 13:56:44.997 | 89.31 |  4 |  3  |
| 2017-03-12 13:59:55.200 | 16.13 |  9 |  4  |
| 2017-03-12 13:59:55.400 | 16.13 | 10 |  4  |
| 2017-03-12 14:00:33.240 | 16.13 | 13 |  4  |
| 2017-03-12 14:03:04.450 | 19.01 |  8 |  5  |
| 2017-03-12 14:04:59.250 | 77.01 | 12 |  6  |
| 2017-03-12 14:05:37.707 | 77.01 | 15 |  6  |
| 2017-03-12 14:07:30.517 | 77.01 | 14 |  6  |
| 2017-03-12 14:08:29.757 | 78.89 |  8 |  7  |

因此可以通过Grp

获得均值(y)
| DATE                    |   x   |Mean(y)| Grp |
----------------------------------------------
| 2017-03-12 13:52:38.707 | 77.01 |  4    |  1  |
| 2017-03-12 13:53:54.920 | 78.89 |  7    |  2  |
| 2017-03-12 13:56:07.130 | 89.31 |  4.5  |  3  |
| 2017-03-12 13:59:55.200 | 16.13 | 10.66 |  4  |
| 2017-03-12 14:03:04.450 | 19.01 |  8    |  5  |
| 2017-03-12 14:04:59.250 | 77.01 | 13.66 |  6  |
| 2017-03-12 14:08:29.757 | 78.89 |  8    |  7  |

我尝试使用GROUPBY或OVER,但每次遇到值为77,01的问题。我只审查了一个小组

SELECT [TS.DATE], TS.X, t_index = DENSE_RANK() OVER (ORDER BY TS.X)
FROM TS

有人可以帮助我吗? 谢谢。

PS:对我的英语道歉

2 个答案:

答案 0 :(得分:3)

您可以使用行号的差异来识别组:

select t.*,
       dense_rank() over (order by x, (seqnum - seqnum_x)) as grp
from (select t.*,
             row_number() over (order by date) as seqnum,
             row_number() over (partition by x order by date) as seqnum_x
      from t
     ) t;

这里的逻辑有点棘手。要理解它,请运行子查询。盯着结果,你就会明白为什么差异能够识别每个群体。

获得平均值:

select t.x, avg(y),
       min(date), max(date),
       dense_rank() over (order min(date)) as grp
from (select t.*,
             row_number() over (order by date) as seqnum,
             row_number() over (partition by x order by date) as seqnum_x
      from t
     ) t
group by x, (seqnum - seqnum_x)

这会按顺序生成组编号(因为分配是在聚合之后)。

以上标识每个组,但原始查询不按日期顺序生成组编号。执行此操作的替代方法是使用lag()和累计sum()

select t.*,
       sum(case when prev_x = x then 0 else 1 end) over (order by date) as grp
from (select t.*,
             lag(x) over (order by date) as prev_x
      from t
     ) t;

这里的逻辑更简单。只需查看上一个值,并在更改时添加一个标志。

答案 1 :(得分:0)

下面的解决方案使用了很多窗口函数,但得到了结果。

日期值基于中值。

它使用与Gordon Linoff相同的技巧,减去2个rownumbers以获得有用的排名。

select max([DATE]), X, avg_y as [Mean(y)], row_number() over (order by max([DATE])) as Grp
from (
    select *,
    PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY Y) OVER (PARTITION BY X, RN_DIFF) as median_y,
    CAST(AVG(Y * 1.0) OVER (PARTITION BY X, RN_DIFF) AS DECIMAL(8,2)) as avg_y
    from(
        select [DATE], X, Y, 
        row_number() over (order by [DATE]) - row_number() over (partition by x order by [DATE]) as RN_DIFF
        from TS t
        ) q1
) q2
where median_y = y
group by X, avg_y, RN_DIFF
order by max([DATE]);