Question

我们在spark sql表中有一个时间序列，用于描述用户每次执行事件的时间。

然而，用户倾向于在爆发中做很多事件。我想找到这些爆发中每个人的最短时间。

不幸的是这是历史数据，所以我无法改变表格的创建方式。所以我基本上想要select min(time_), user from my_table group by user，但每次爆发。任何帮助将不胜感激！

编辑：

一些示例数据将是：

user time_ 0 10 0 11 2 12 0 12 2 13 2 15 0 83 0 84 0 85

所以例如在上面的数据中我想找到（0,10），（2,12）和（0,83）。我们可以说如果在1小时内发生突发（在上面的示例数据中将是60）。

Answer 1

如果这是您需要的唯一信息：

select      user
           ,time_

from       (select      user
                       ,time_
                       ,case when time_ - lag (time_,1,time_-60) over (partition by user order by time_) >= 60 then 'Y' else null end  as burst

            from        my_table 
            ) t

where       burst = 'Y'
;

user    time_
0       10
0       83
2       12

如果您需要收集有关每次爆发的其他信息：

select      user
           ,burst_seq

           ,min (time_) as min_time_
           ,max (time_) as max_time_
           ,count (*)   as events_num

from       (select      user
                       ,time_

                       ,count(burst) over 
                        (
                            partition by    user 
                            order by        time_  
                            rows unbounded preceding
                        ) + 1                           as burst_seq

            from       (select      user
                                   ,time_
                                   ,case when time_ - lag (time_) over (partition by user order by time_) >= 60 then 'Y' else null end as burst

                        from        my_table 
                        ) t
            ) t

group by    user
           ,burst_seq
;

user    burst_seq   min_time_   max_time_   events_num

0       1           10          12          3
0       2           83          85          3
2       1           12          15          3

P.S。 CASE声明似乎有一个错误 case when ... then 'Y' end产生 FAILED：IndexOutOfBoundsException索引：2，大小：2 虽然它是合法的语法。
添加else null解决了它。

在SQL中找到第一个目标

1 个答案: