根据Hive中的其他列查找列中时间戳的差异

时间:2018-09-11 21:00:02

标签: hive hiveql

我在Hive中有一个如下表。

我想为seconds相同的列计算id中的时差,并在time_diff列中获取值。

Table

+-----+---------+------------------------+-----------+
| id  |  event  |            eventdate   |time_diff  |
+-----+---------+------------------------+-----------+
| 1   | sent    | 2017-11-23 03:49:59.0  | 0         |
| 2   | sent    | 2017-11-23 04:49:59.0  | 0         |
| 1   | click   | 2017-11-24 03:49:50.0  | NULL      |
+-----+---------+------------------------+-----------+

expected result

+-----+---------+------------------------+-----------+
| id  |  event  |            eventdate   |time_diff  |
+-----+---------+------------------------+-----------+
| 1   | sent    | 2017-11-23 03:49:59.0  | 0         |
| 2   | sent    | 2017-11-23 04:49:59.0  | 0         |
| 1   | click   | 2017-11-24 03:49:50.0  | 86391     |
+-----+---------+------------------------+-----------+

我已经手动完成了以下操作

SELECT (unix_timestamp('2017-11-24 03:49:50.0') - unix_timestamp('2017-11-23 03:49:59.0'));

我得到的值是86391,但我无法弄清楚当两个id相同时该如何做

如何获得预期的结果

  

编辑

+-----+---------+------------------------+-----------+
| id  |  event  |            eventdate   |time_diff  |
+-----+---------+------------------------+-----------+
| 1   | sent    | 2017-11-23 03:49:50.0  | 0         |
| 1   | sent    | 2017-11-23 03:49:59.0  | 0         |
| 2   | sent    | 2017-11-23 04:49:59.0  | 0         |
| 1   | click   | 2017-11-24 03:49:50.0  | NULL      |
+-----+---------+------------------------+-----------+

2 个答案:

答案 0 :(得分:1)

您可以尝试将LAG与窗口功能配合使用。

模式(MySQL v8.0)

CREATE TABLE T(
  id int,
  event varchar(50),
  eventdate datetime
);




insert into T values (1,'sent', '2017-11-23 03:49:59.0');
insert into T values (2,'sent', '2017-11-23 04:49:59.0');
insert into T values (1,'click', '2017-11-24 03:49:50.0');

查询#1

SELECT *, 
   coalesce(unix_timestamp(eventdate) - unix_timestamp(LAG(eventdate) OVER(PARTITION BY ID ORDER BY eventdate)),0) time_diff
FROM T;

| id  | event | eventdate           | time_diff |
| --- | ----- | ------------------- | --------- |
| 1   | sent  | 2017-11-23 03:49:59 | 0         |
| 1   | click | 2017-11-24 03:49:50 | 86391     |
| 2   | sent  | 2017-11-23 04:49:59 | 0         |

View on DB Fiddle

答案 1 :(得分:1)

在很大程度上重复了先前的答案,但我认为值得强调一下手册: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

CREATE TABLE test (id INT, event VARCHAR(8), eventdate timestamp);
INSERT INTO test VALUES (1, 'sent', '2017-11-23 03:49:50.0');
INSERT INTO test VALUES (1, 'sent', '2017-11-23 03:49:59.0');
INSERT INTO test VALUES (2, 'sent', '2017-11-23 04:49:59.0');
INSERT INTO test VALUES (1, 'click', '2017-11-24 03:49:50.0');

SELECT
    id
,   event
,   eventdate
,   CASE WHEN event = 'sent'
    THEN 0
    ELSE
        unix_timestamp(eventdate) - MIN(unix_timestamp(eventdate))
            OVER (PARTITION BY id)
    END AS time_diff
FROM test;

+------+-------+---------------------+-----------+
| id   | event | eventdate           | time_diff |
+------+-------+---------------------+-----------+
|    1 | sent  | 2017-11-23 03:49:50 |         0 |
|    1 | sent  | 2017-11-23 03:49:59 |         0 |
|    1 | click | 2017-11-24 03:49:50 |     86400 |
|    2 | sent  | 2017-11-23 04:49:59 |         0 |
+------+-------+---------------------+-----------+

基于对此类联系/响应数据的经验以及多次点击事件的可能性以及MIN()应该与初始发送事件相关的假设,我去了time_diff。显然,可以根据需要调整窗口功能。