如果密钥不同,如何在HiveSQL中获取第一个和最后一个记录

时间:2016-12-07 16:50:21

标签: sql hive

如果其中一个关键字段随着时间的推移使用Hive表不同,我需要获取用户的第一条和最后一条记录:

这是一些示例数据:

UserID  EntryDate   Activity
a3324   1/1/16  walk
a3324   1/2/16  walk
a3324   1/3/16  walk
a3324   1/4/16  run
a5613   1/1/16  walk
a5613   1/2/16  walk
a5613   1/3/16  walk
a5613   1/4/16  walk

我正在寻找输出,最好是这样:

a3324   1/1/16  walk    1/4/16  run

或至少像这样:

a3324   walk    run

我开始编写这样的代码:

SELECT UserID, MINIMUM(EntryDate), MAXIMUM(EntryDate), Activity
FROM
     SELECT UserID, DISTINCT Activity
     GROUP BY UserID
     HAVING Count(Activity) > 1

但我知道不是。

我还希望能够指定原始活动是Walk的情况,第二个活动可能是在Where子句中运行。

你能帮忙解决一下吗?

由于

2 个答案:

答案 0 :(得分:0)

SELECT
    t.UserId
    ,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.EntryDate END) as MinEntryDate
    ,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.Activity END) as MinActivity
    ,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.EntryDate END) as MaxEntryDate
    ,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.Activity END) as MaxActivity
FROM
    (
       SELECT
          UserId
          ,EntryDate
          ,Activity
          ,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate) as RowNumAsc
          ,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate DESC) as RowNumDesc
       FROM
          Table
    ) t
WHERE
    t.RowNumAsc = 1
    OR t.RowNumDesc = 1
GROUP BY
    t.UserId

支持窗口函数https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),因此EntryDate 升序使用2行编号, Descending使用另一行编号 条件聚合可以帮助您找到答案。

如果您不想使用分析函数(窗口函数),您可以使用自我左连接条件聚合

SELECT
    t.UserId
    ,MIN(CASE WHEN mn.UserId IS NULL THEN t.EntryDate END) as MinEntryDate
    ,MIN(CASE WHEN mn.UserId IS NULL THEN t.Activity END) as MinActivity
    ,MAX(CASE WHEN mx.UserId IS NULL THEN t.EntryDate END) as MaxEntryDate
    ,MAX(CASE WHEN mx.UserId IS NULL THEN t.Activity END) as MaxActivity
FROM
    Table t
    LEFT JOIN Table mn
    ON t.UserId = mn.UserId
    AND t.EntryDate > mn.EntryDate
    LEFT JOIN Table mx
    ON t.UserId = mx.UserId
    AND t.EntryDate < mx.EntryDate
WHERE
    mn.UserId IS NULL
    OR mx.UserId IS NULL
GROUP BY
    t.UserId

或相关的子查询方式:

SELECT
    UserId
    ,MIN(EntryDate) as MinEntryDate
    ,(SELECT
          Activity
       FROM
          Activity a
       WHERE
          u.UserId = a.UserId
          AND a.EntryDate = MIN(u.EntryDate)
       LIMIT 1
    ) as MinActivity
    ,MAX(EntryDate) as MaxEntryDate
    ,(SELECT
          Activity
       FROM
          Activity a
       WHERE
          u.UserId = a.UserId
          AND a.EntryDate = MAX(u.EntryDate)
       LIMIT 1
          ) as MaxActivity
FROM
    Activity u
GROUP BY
    UserId

答案 1 :(得分:0)

您可以使用滞后/潜在客户来获得解决方案

 SELECT * FROM (
    select UserID  ,EntryDate ,  Activityslec, 
    lead(Activityslec, 1) over (UserID  ,EntryDate ) as nextActivityslec 
    from table) as A
 where Activityslec <> nextActivityslec