如何在SQL中按运行顺序查找峰谷

时间:2018-10-11 13:15:04

标签: sql amazon-athena presto

所以我在雅典娜有一个数据集,因此,出于这个目的,您可以将其视为postgres数据库。可以在此sql fiddle中看到数据样本。

这里是一个示例:

&&

我想要得到的是一个包含所有值但突出显示“ p”的最大值和连续的“ v”的最小值的数据集。

所以最终我会得到:

  create table vals (
  timestamp int,
  type varchar(25),
  val int
  );

  insert into vals(timestamp,type, val) 
  values      (10, null, 1),
              (20, null, 2),
              (39, null, 1),
              (40,'p',1),
              (50,'p',2),
              (60,'p',1),
              (70,'v',5),
              (80,'v',6),
              (90,'v',6),
              (100,'v',3),
              (110,null,3),
              (120,'v',6),
              (130,null,3),
              (140,'p',10),
              (150,'p',8),
              (160,null,3),
              (170,'p',1),
              (180,'p',2),
              (190,'p',2),
              (200,'p',1),
              (210,null,3),
              (220,'v',1),
              (230,'v',1),
              (240,'v',3),
              (250,'v',41)               

is peak对于类型有很多选择,如果它是某种密集的秩或递增的数字就可以了。如此一来,我就可以确信,在连续的范围内,“标记”的是最高值或最低值。

祝你好运,谢谢协助

注意:峰的最大值或峰谷的最小值可以在连续集中的某个位置,但是一旦类型改变,我们就会重新开始。

3 个答案:

答案 0 :(得分:3)

您可以使用LEAD/LAG window functions

var dst ...

db<>fiddle demo

输出:

SELECT *,
  CASE WHEN type = 'p' AND val>LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
        AND val > LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
       WHEN type = 'v' AND val<LAG(val) OVER(PARTITION BY type ORDER BY timestamp)
       AND val < LEAD(val) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
  END AS is_peak
FROM vals
ORDER BY timestamp;

带有window子句的版本:

┌───────────┬───────┬──────┬─────────┐
│ timestamp │ type  │ val  │ is_peak │
├───────────┼───────┼──────┼─────────┤
│       10  │       │   1  │         │
│       20  │       │   2  │         │
│       39  │       │   1  │         │
│       40  │ p     │   1  │         │
│       50  │ p     │   2  │       1 │
│       60  │ p     │   1  │         │
│       70  │ v     │   5  │         │
│       80  │ v     │   6  │         │
│       90  │ v     │   6  │         │
│      100  │ v     │   3  │       1 │
│      110  │       │   3  │         │
│      120  │ v     │   6  │         │
│      130  │       │   3  │         │
│      140  │ p     │  10  │       1 │
│      150  │ p     │   8  │         │
└───────────┴───────┴──────┴─────────┘

db<>fiddle demo2

编辑

  

我认为,只需进行很小的更改,我们就可以得到时间戳记120,就这样

SELECT *, CASE WHEN type = 'p' AND val > LAG(val) OVER s
                AND val > LEAD(val) OVER s THEN 1 
               WHEN type = 'v' AND val < LAG(val) OVER s
                AND val < LEAD(val) OVER s THEN 1 
          END AS is_peak
FROM vals
WINDOW s AS (PARTITION BY type ORDER BY timestamp)
ORDER BY timestamp;

db<>fiddle demo3


编辑2:

具有SELECT *,CASE WHEN type IN ('p','v') AND val > LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AND val > LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 WHEN type IN ('v') AND val < LAG(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AND val < LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 END AS is_peak FROM vals ORDER BY timestamp; 检测(处理平台)的最终解:

gaps-and-islands

db<>fiddle demo final

输出:

WITH cte AS (
  SELECT *, LEAD(val,1,0) OVER(PARTITION BY type ORDER BY timestamp) AS l
  FROM vals
), cte2 AS (
  SELECT *, SUM(CASE WHEN val = l THEN 1 ELSE 0 END) OVER(PARTITION BY type ORDER BY timestamp) AS dr
  FROM cte
), cte3 AS (
  SELECT *, CASE WHEN type IN ('p') AND val > LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val >= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
               WHEN type IN ('v') AND val < LAG(val,1) OVER(PARTITION BY type ORDER BY timestamp)
                AND val <= LEAD(val,1) OVER(PARTITION BY type ORDER BY timestamp) THEN 1 
          END AS is_peak
  FROM cte2
)
SELECT timestamp, type, val,
     CASE WHEN is_peak = 1 THEN 1 
          WHEN EXISTS (SELECT 1 FROM cte3 cx
                       WHERE cx.is_peak = 1
                         AND cx.val = cte3.val
                         AND cx.type = cte3.type
                         AND cx.dr = cte3.dr)
              THEN 1
     END is_peak
FROM cte3
ORDER BY timestamp;

附加说明:

ISO SQL:2016为这种情况添加了模式匹配MATCH_RECOGNIZE,在这种情况下,您为┌────────────┬───────┬──────┬─────────┐ │ timestamp │ type │ val │ is_peak │ ├────────────┼───────┼──────┼─────────┤ │ 10 │ │ 1 │ │ │ 20 │ │ 2 │ │ │ 39 │ │ 1 │ │ │ 40 │ p │ 1 │ │ │ 50 │ p │ 2 │ 1 │ │ 60 │ p │ 1 │ │ │ 70 │ v │ 5 │ │ │ 80 │ v │ 6 │ │ │ 90 │ v │ 6 │ │ │ 100 │ v │ 3 │ 1 │ │ 110 │ │ 3 │ │ │ 120 │ v │ 6 │ │ │ 130 │ │ 3 │ │ │ 140 │ p │ 10 │ 1 │ │ 150 │ p │ 8 │ │ │ 160 │ │ 3 │ │ │ 170 │ p │ 1 │ │ │ 180 │ p │ 2 │ 1 │ │ 190 │ p │ 2 │ 1 │ │ 200 │ p │ 1 │ │ │ 210 │ │ 3 │ │ │ 220 │ v │ 1 │ 1 │ │ 230 │ v │ 1 │ 1 │ │ 240 │ v │ 3 │ │ │ 250 │ v │ 41 │ │ └────────────┴───────┴──────┴─────────┘ 之类的峰值定义了正则表达式,但目前仅Oracle支持。

相关文章:Modern SQL - match_recognize Regular Expressions Over Rows

答案 1 :(得分:3)

有一个小技巧可以解决像这样的“离岛”问题。

通过从行号中减去行号超过某个值,您可以得到一些排名。

出于某些目的,此方法有一些缺点。
但这适用于这种情况。

一旦计算出排名,外部查询中的其他窗口函数便可以使用该排名。
我们可以再次使用row_number。 但是根据要求,您可以改用DENSE_RANK或MIN&MAX的窗口函数。

然后,我们仅将它们包装在CASE中,以根据类型来选择不同的逻辑。

select timestamp, type, val, 
(case 
 when type = 'v' and row_number() over (partition by (rn1-rn2), type order by val, rn1) = 1 then 1
 when type = 'p' and row_number() over (partition by (rn1-rn2), type order by val desc, rn1) = 1 then 1
 end) is_peak
-- , rn1, rn2, (rn1-rn2) as rnk
from
(
  select timestamp, type, val,
   row_number() over (order by timestamp) as rn1,
   row_number() over (partition by type order by timestamp) as rn2
  from vals
) q
order by timestamp;

您可以测试SQL提琴here

返回:

timestamp   type    val     is_peak
---------   ----    ----    -------
10          null    1       null
20          null    2       null
39          null    1       null
40          p       1       null
50          p       2       1
60          p       1       null
70          v       5       null
80          v       6       null
90          v       6       null
100         v       3       1
110         null    3       null
120         v       6       1
130         null    3       null
140         p       10      1
150         p       8       null
160         null    3       null
170         p       1       null
180         p       2       1
190         p       2       null
200         p       1       null
210         null    3       null
220         v       1       1
230         v       1       null
240         v       3       null
250         v       41      null

答案 2 :(得分:1)

您可以在case语句中使用子查询来实现此目的:

create table #vals 
(
    [timestamp] int,
    [type] varchar(25),
    val int
);

insert into #vals ([timestamp], [type], val) 
values  (10, null, 1),
        (20, null, 2),
        (30, null, 1),
        (40,'p',1),
        (50,'p',2),
        (60,'p',1),
        (70,'v',5),
        (80,'v',6),
        (90,'v',6),
        (100,'v',3),
        (110,null,3)

select 
    r.*,
    case 
        when r.[type] = 'p' and not exists (select * from #vals c where c.[type] = r.[type] and c.val > r.val) then 1
        when r.[type] = 'v' and not exists (select * from #vals c where c.[type] = r.[type] and c.val < r.val) then 1
        else null
    end as is_peak
from #vals r

drop table #vals

结果:

/----------------------------------\
| timestamp | type | val | is_peak |
|-----------|------|-----|---------|
| 10        | NULL | 1   | NULL    |
| 20        | NULL | 2   | NULL    |
| 30        | NULL | 1   | NULL    |
| 40        | p    | 1   | NULL    |
| 50        | p    | 2   | 1       |
| 60        | p    | 1   | NULL    |
| 70        | v    | 5   | NULL    |
| 80        | v    | 6   | NULL    |
| 90        | v    | 6   | NULL    |
| 100       | v    | 3   | 1       |
| 110       | NULL | 3   | NULL    |
\----------------------------------/

注意:如果有多条记录具有相同(峰值)val,则它们将在1列中分别用is_peak标记。