为相同的记录组合增加Hive中的时间戳

时间:2017-09-01 08:55:04

标签: hadoop hive

我在hive表中有一个数据集

input1,input2,input_time
key1,val1,2017-02-03 00:00:00
key1,val1,2017-02-03 00:00:00
key1,val2,2017-02-03 00:00:00
key1,val2,2017-02-03 00:00:00
key2,val1,2017-02-03 00:00:00

列(input1,input2)将形成唯一的组合记录。对于相同的唯一组合,我想用秒增加input_time列,即“2017-02-03 00:00:01”。

对于相同的组合说我有65条记录,一旦第二次达到59秒,它应该增加(分钟+秒),即“2017-02-03 00:01:01”

我们如何增加相同记录组合的时间,是否可以在配置单元中使用?

Expected output:
input1,input2,input_time
key1,val1,2017-02-03 00:00:01
key1,val1,2017-02-03 00:00:02
key1,val2,2017-02-03 00:00:01
key1,val2,2017-02-03 00:00:02
key2,val1,2017-02-03 00:00:01

1 个答案:

答案 0 :(得分:0)

您可以使用窗口函数为要添加的每一行生成临时索引。

select 
   k, v , unix_timestamp(ts) as ts, 
   row_number() over ( partition by k,v ) as rn  
from ts_test

这将产生:

+----+----+----------+---+
|   k|   v|        ts| rn|
+----+----+----------+---+
|key1|val1|1486101600|  1|
|key1|val1|1486101600|  2|
|key1|val2|1486101600|  1|
|key1|val2|1486101600|  2|
|key2|val1|1486101600|  1|
+----+----+----------+---+

现在您可以继续将其添加到您的时间字符串中,因为它已经是ISO格式。

SELECT a.k, a.v, from_unixtime(ts+rn) as newts from 
   ( select k, v , unix_timestamp(ts) as ts, row_number() over ( partition by k,v ) as rn  
from ts_test )a 

+----+----+-------------------+
|   k|   v|              newts|
+----+----+-------------------+
|key1|val1|2017-02-03 00:00:01|
|key1|val1|2017-02-03 00:00:02|
|key1|val2|2017-02-03 00:00:01|
|key1|val2|2017-02-03 00:00:02|
|key2|val1|2017-02-03 00:00:01|
+----+----+-------------------+

这也可以通过@DuduMarkovitz所说的单一选择来实现:

select 
   k, v , 
   from_unixtime(unix_timestamp(ts) + row_number() over ( partition by k,v order by v asc ) ) 
from ts_test