用户在一个月内花费的最长时间,一年中每个月使用时间段

时间:2017-08-12 19:26:28

标签: sql hive hiveql

我正在努力寻找一个月中花费最多时间的用户,一年中的每个月

我正在使用以下数据

   uid             activity-time                status
   ...          ...................            ........
    1           2016-12-31 16:00:04            sign in
    1           2016-12-31 21:05:37            sign out
    2           2016-12-25 18:00:04            sign in
    2           2016-12-25 20:45:31            sign out 
    7           2016-10-31 13:00:04            sign in
    7           2016-10-31 16:05:30            sign out
    1           2016-12-27 17:00:04            sign in
    1           2016-12-27 19:05:00            sign out
    2           2016-10-25 18:00:04            sign in
    2           2016-10-25 20:45:31            sign out
    4           2017-12-31 16:00:04            sign in
    4           2017-12-31 21:05:37            sign out
    3           2017-12-25 18:00:04            sign in
    3           2017-12-25 20:45:31            sign out 
    7           2017-10-31 16:00:04            sign in
    7           2017-10-31 21:05:37            sign out
    3           2017-10-25 18:00:04            sign in
    3           2017-10-25 20:45:31            sign out 

我期待以下输出

uid        year  month      time-spent
......     ..... .....      ..........
1          2016   12        07:10:45
7          2016   10        03:05:34
4          2017   12        05:05:41
7          2017   10        05:05:41

我尝试过以下查询,但我不知道如何指定登录和注销的条件

SELECT ETS.*
FROM (SELECT year(activity-time),month(activity-time), uid, count(uid) as c,
ROW_NUMBER() OVER (PARTITION BY month(activity-time) ORDER BY COUNT(uid) DESC) as seq
FROM activity_table
GROUP BY month(activity-time),year(activity-time), uid
) ds
WHERE seq = 1
ORDER BY c DESC ;

2 个答案:

答案 0 :(得分:0)

您可以使用lag的嵌套查询来获取登录和退出记录之间的时差。

我没有hiveql,所以我可能会关闭一些特定的日期/时间函数,但想法是:

select yr,
       mnth,
       uid,
       from_unixtime(spent, 'hh:mm:ss') spent
from (
        select year(activity_time) yr, 
               month(activity_time) mnth,
               uid, 
               sum(spent) spent,
               row_number() over (partition by year(activity_time), month(activity_time)
                                  order by     sum(spent) desc) rn
        from (
                select uid,
                       activity_time,
                       status,
                       unix_timestamp(activity_time) 
                           - lag(unix_timestamp(activity_time)) 
                                 over (partition by uid order by activity_time) spent
                from   activity_table
             ) base
        where status = 'sign out'
        group by year(activity_time), 
                 month(activity_time),
                 uid
      ) grouped
where rn = 1;

注意:我建议不要在列名中使用连字符,而是使用下划线(我在上面的SQL中做过)。

答案 1 :(得分:0)

这是在SQL Server中,但应该给你一个想法。我首先创建了一个CTE,它将计算从时间开始的总秒数,以便我可以使用SUM - 按ID,MM-yyyy日期分组并在之后再次将其转换为时间格式。然后使用row_number获取每个日期的最大值。

;WITH activity_table_seconds 
     AS (SELECT [uid], 
                [activity-time], 
                ( Datepart(hour, [activity-time]) * 60 * 60 ) + ( 
                Datepart(minute, [activity-time]) * 60 ) + 
                Datepart(second, [activity-time]) AS 
                [activity-time-seconds], 
                [status] 
         FROM   @activity_table) 
SELECT [uid], 
       [date], 
       [activity-time] 
FROM   (SELECT *, 
               Row_number () 
                 OVER ( 
                   partition BY [date] 
                   ORDER BY [activity-time] DESC) rn 
        FROM   (SELECT a.[uid], 
                       Format(a.[activity-time], 'MM-yyyy') AS [date], 
                       CONVERT(VARCHAR(8), 
                       Dateadd(second, Sum(b.[activity-time-seconds] - 
                                           a.[activity-time-seconds]), 0), 
                               108) AS [activity-time] 
                FROM   (SELECT * 
                        FROM   activity_table_seconds 
                        WHERE  [status] = 'sign in') a 
                       INNER JOIN (SELECT * 
                                   FROM   activity_table_seconds 
                                   WHERE  [status] = 'sign out') b 
                               ON a.[uid] = b.[uid] 
                                  AND Cast(a.[activity-time] AS DATE) = Cast( 
                                      b.[activity-time] AS DATE) 
                GROUP  BY a.[uid], 
                          Format(a.[activity-time], 'MM-yyyy')) a) b 
WHERE  b.rn = 1