我在Hive
中有一个如下表。
我想为seconds
相同的列计算id
中的时差,并在time_diff
列中获取值。
Table
+-----+---------+------------------------+-----------+
| id | event | eventdate |time_diff |
+-----+---------+------------------------+-----------+
| 1 | sent | 2017-11-23 03:49:59.0 | 0 |
| 2 | sent | 2017-11-23 04:49:59.0 | 0 |
| 1 | click | 2017-11-24 03:49:50.0 | NULL |
+-----+---------+------------------------+-----------+
expected result
+-----+---------+------------------------+-----------+
| id | event | eventdate |time_diff |
+-----+---------+------------------------+-----------+
| 1 | sent | 2017-11-23 03:49:59.0 | 0 |
| 2 | sent | 2017-11-23 04:49:59.0 | 0 |
| 1 | click | 2017-11-24 03:49:50.0 | 86391 |
+-----+---------+------------------------+-----------+
我已经手动完成了以下操作
SELECT (unix_timestamp('2017-11-24 03:49:50.0') - unix_timestamp('2017-11-23 03:49:59.0'));
我得到的值是86391
,但我无法弄清楚当两个id
相同时该如何做
如何获得预期的结果
编辑
+-----+---------+------------------------+-----------+
| id | event | eventdate |time_diff |
+-----+---------+------------------------+-----------+
| 1 | sent | 2017-11-23 03:49:50.0 | 0 |
| 1 | sent | 2017-11-23 03:49:59.0 | 0 |
| 2 | sent | 2017-11-23 04:49:59.0 | 0 |
| 1 | click | 2017-11-24 03:49:50.0 | NULL |
+-----+---------+------------------------+-----------+
答案 0 :(得分:1)
您可以尝试将LAG
与窗口功能配合使用。
模式(MySQL v8.0)
CREATE TABLE T(
id int,
event varchar(50),
eventdate datetime
);
insert into T values (1,'sent', '2017-11-23 03:49:59.0');
insert into T values (2,'sent', '2017-11-23 04:49:59.0');
insert into T values (1,'click', '2017-11-24 03:49:50.0');
查询#1
SELECT *,
coalesce(unix_timestamp(eventdate) - unix_timestamp(LAG(eventdate) OVER(PARTITION BY ID ORDER BY eventdate)),0) time_diff
FROM T;
| id | event | eventdate | time_diff |
| --- | ----- | ------------------- | --------- |
| 1 | sent | 2017-11-23 03:49:59 | 0 |
| 1 | click | 2017-11-24 03:49:50 | 86391 |
| 2 | sent | 2017-11-23 04:49:59 | 0 |
答案 1 :(得分:1)
在很大程度上重复了先前的答案,但我认为值得强调一下手册: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
CREATE TABLE test (id INT, event VARCHAR(8), eventdate timestamp);
INSERT INTO test VALUES (1, 'sent', '2017-11-23 03:49:50.0');
INSERT INTO test VALUES (1, 'sent', '2017-11-23 03:49:59.0');
INSERT INTO test VALUES (2, 'sent', '2017-11-23 04:49:59.0');
INSERT INTO test VALUES (1, 'click', '2017-11-24 03:49:50.0');
SELECT
id
, event
, eventdate
, CASE WHEN event = 'sent'
THEN 0
ELSE
unix_timestamp(eventdate) - MIN(unix_timestamp(eventdate))
OVER (PARTITION BY id)
END AS time_diff
FROM test;
+------+-------+---------------------+-----------+
| id | event | eventdate | time_diff |
+------+-------+---------------------+-----------+
| 1 | sent | 2017-11-23 03:49:50 | 0 |
| 1 | sent | 2017-11-23 03:49:59 | 0 |
| 1 | click | 2017-11-24 03:49:50 | 86400 |
| 2 | sent | 2017-11-23 04:49:59 | 0 |
+------+-------+---------------------+-----------+
基于对此类联系/响应数据的经验以及多次点击事件的可能性以及MIN()
应该与初始发送事件相关的假设,我去了time_diff
。显然,可以根据需要调整窗口功能。