如果其中一个关键字段随着时间的推移使用Hive表不同,我需要获取用户的第一条和最后一条记录:
这是一些示例数据:
UserID EntryDate Activity
a3324 1/1/16 walk
a3324 1/2/16 walk
a3324 1/3/16 walk
a3324 1/4/16 run
a5613 1/1/16 walk
a5613 1/2/16 walk
a5613 1/3/16 walk
a5613 1/4/16 walk
我正在寻找输出,最好是这样:
a3324 1/1/16 walk 1/4/16 run
或至少像这样:
a3324 walk run
我开始编写这样的代码:
SELECT UserID, MINIMUM(EntryDate), MAXIMUM(EntryDate), Activity
FROM
SELECT UserID, DISTINCT Activity
GROUP BY UserID
HAVING Count(Activity) > 1
但我知道不是。
我还希望能够指定原始活动是Walk的情况,第二个活动可能是在Where子句中运行。
你能帮忙解决一下吗?
由于
答案 0 :(得分:0)
SELECT
t.UserId
,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.EntryDate END) as MinEntryDate
,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.Activity END) as MinActivity
,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.EntryDate END) as MaxEntryDate
,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.Activity END) as MaxActivity
FROM
(
SELECT
UserId
,EntryDate
,Activity
,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate) as RowNumAsc
,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate DESC) as RowNumDesc
FROM
Table
) t
WHERE
t.RowNumAsc = 1
OR t.RowNumDesc = 1
GROUP BY
t.UserId
支持窗口函数(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),因此EntryDate
升序使用2行编号, Descending使用另一行编号 条件聚合可以帮助您找到答案。
如果您不想使用分析函数(窗口函数),您可以使用自我左连接和条件聚合 :
SELECT
t.UserId
,MIN(CASE WHEN mn.UserId IS NULL THEN t.EntryDate END) as MinEntryDate
,MIN(CASE WHEN mn.UserId IS NULL THEN t.Activity END) as MinActivity
,MAX(CASE WHEN mx.UserId IS NULL THEN t.EntryDate END) as MaxEntryDate
,MAX(CASE WHEN mx.UserId IS NULL THEN t.Activity END) as MaxActivity
FROM
Table t
LEFT JOIN Table mn
ON t.UserId = mn.UserId
AND t.EntryDate > mn.EntryDate
LEFT JOIN Table mx
ON t.UserId = mx.UserId
AND t.EntryDate < mx.EntryDate
WHERE
mn.UserId IS NULL
OR mx.UserId IS NULL
GROUP BY
t.UserId
或相关的子查询方式:
SELECT
UserId
,MIN(EntryDate) as MinEntryDate
,(SELECT
Activity
FROM
Activity a
WHERE
u.UserId = a.UserId
AND a.EntryDate = MIN(u.EntryDate)
LIMIT 1
) as MinActivity
,MAX(EntryDate) as MaxEntryDate
,(SELECT
Activity
FROM
Activity a
WHERE
u.UserId = a.UserId
AND a.EntryDate = MAX(u.EntryDate)
LIMIT 1
) as MaxActivity
FROM
Activity u
GROUP BY
UserId
答案 1 :(得分:0)
您可以使用滞后/潜在客户来获得解决方案
SELECT * FROM (
select UserID ,EntryDate , Activityslec,
lead(Activityslec, 1) over (UserID ,EntryDate ) as nextActivityslec
from table) as A
where Activityslec <> nextActivityslec