通过连续的重复标志对带有时间戳的记录进行分组

时间:2019-06-21 09:18:24

标签: sql hive

我有一个包含以下列的数据集:

DriverId  DateStamp        IsDriving  WasDriving DistanceSincePrev SecondsSincePrev
1         11/10/2018 08:00 0          0          0                 12
1         11/10/2018 08:01 1          0          10                60
1         11/10/2018 08:01 1          1          100               54
1         11/10/2018 08:02 1          1          14                32
1         11/10/2018 08:03 1          1          33                60
1         11/10/2018 08:04 0          1          10                59
1         11/10/2018 08:04 0          0          0                 60
1         11/10/2018 08:05 1          0          0                 60
1         11/10/2018 08:06 1          1          500               43
1         11/10/2018 08:06 0          1          300               32
1         11/10/2018 08:07 0          0          0                 60
1         11/10/2018 08:08 0          0          0                 12
1         11/10/2018 08:09 0          0          10                60
1         11/10/2018 08:10 0          0          100               54
1         11/10/2018 08:11 0          0          14                32
1         11/10/2018 08:12 0          0          33                60
1         11/10/2018 08:13 0          0          10                59
1         11/10/2018 08:14 0          0          0                 60
1         11/10/2018 08:15 1          0          0                 60
1         11/10/2018 08:16 1          1          500               43
1         11/10/2018 08:16 1          1          300               32
1         11/10/2018 08:17 1          1          0                 60
1         11/10/2018 08:18 1          1          500               43
1         11/10/2018 08:19 1          1          300               32
1         11/10/2018 08:19 1          1          0                 60
1         11/10/2018 08:20 1          1          500               43
1         11/10/2018 08:21 1          1          300               32
1         11/10/2018 08:22 1          1          0                 60
1         11/10/2018 08:23 1          1          500               43
1         11/10/2018 08:24 1          1          300               32
1         11/10/2018 08:24 0          1          0                 60
1         11/10/2018 08:25 0          0          0                 60

如您所见,这些是一个人驾驶的时间戳。我想将这些时间戳归类为RIDES,我的意思是该人在不关闭引擎的情况下驾驶的部分。在此数据集中,我可以使用“ IsDriving”和“ WasDriving”列进行此操作。但是我在编写查询时遇到问题。

我对算法如何工作有2个想法

1)更理想,可能更困难:查询将检测IsDriving为1且WasDriving为0的记录并将其计为旅程的开始。然后它将检测IsDriving为0和WasDriving为1的记录,并在那里结束旅程。

2)有点启发式,但已经足够了:查询将简单地汇总IsDriving和WasDriving都连续设置为1的记录,并将其计为一次旅程。

不幸的是,我无法将这种算法应用于SQL。

理想情况下,我的输出如下所示:

DriverId StartOfRide       DistanceOfRide  LengthOfRide
1        11/10/2018 08:00  1400            221
1        11/10/2018 08:30  5900            329
1        11/10/2018 12:00  21400           3600

2 个答案:

答案 0 :(得分:1)

也许会这样做,删除/添加您不需要的列:

create table #tmp (DriverId int , DateStamp datetime, IsDriving int , WasDriving int, DistanceSincePrev float, SecondsSincePrev float)

insert into #tmp values 
(1,        ' 11/10/2018 08:00', 0  ,        0     ,     0      ,           12),
(1,         '11/10/2018 08:01', 1 ,         0  ,        10  ,              60),
(1,         '11/10/2018 08:01' ,1 ,         1  ,        100 ,              54),
(1,         '11/10/2018 08:02' ,1 ,         1   ,       14  ,              32),
(1,         '11/10/2018 08:03' ,1 ,         1    ,      33,                60),
(1,         '11/10/2018 08:04' ,0 ,         1     ,     10  ,              59),
(1,         '11/10/2018 08:04' ,0 ,         0      ,    0   ,              60),
(1,         '11/10/2018 08:05' ,1 ,         0    ,      0   ,              60),
(1,         '11/10/2018 08:06' ,1 ,         1      ,    500  ,             43),
(1,         '11/10/2018 08:06' ,0 ,         1     ,     300  ,             32),
(1,         '11/10/2018 08:07' ,0 ,         0     ,     0    ,             60),
(1,         '11/10/2018 08:08' ,0 ,         0     ,     0   ,              12),
(1,         '11/10/2018 08:09' ,0 ,         0     ,     10  ,              60),
(1,         '11/10/2018 08:10' ,0 ,         0     ,     100,               54),
(1,         '11/10/2018 08:11' ,0 ,         0     ,     14 ,               32),
(1,        ' 11/10/2018 08:12' ,0 ,         0     ,     33  ,              60),
(1,         '11/10/2018 08:13' ,0 ,         0     ,     10  ,              59)





select * from 

(
select DateStamp as RideStart,DriverID, Grp,(SUM(DistanceSincePrev) over (partition by grp)) as DistanceofRide,
(SUM(SecondsSincePrev ) over (partition by grp)) as LengthofRide,
ROW_NUMBER() over (PARTITION by driverid,grp order by datestamp) r
from
(
 SELECT
    *,
    Grp = ROW_NUMBER() OVER (PARTITION BY driverID ORDER BY DateStamp) -
     ROW_NUMBER() OVER (PARTITION BY driverID,IsDriving ORDER BY DateStamp)
  FROM #tmp
) s
) x
where r = 1 

答案 1 :(得分:1)

您需要分配组,然后进行汇总。在这种情况下,您可以将一个组定义为0中的IsDriving个值的数量,直至每个记录。然后聚合:

select driverid, min(datestamp) as startofride,
       sum(distance) as distance,
       sum(seconds) as seconds
from (select t.*,
             sum(1 - isdriving) over (partition by driverid order by datestamp) as grp
      from t
     ) t
group by driverid, grp