Spark DataFrame - scala - 如何基于时间戳对表的行进行分组,间隔为5分钟

时间:2016-09-08 03:26:02

标签: scala apache-spark

我有一张类似

的Cassandra表
CREATE TABLE vehicle_speed (
   device_id UUID,
   intersection text,   
   date text,
   time timestamp,
   vehicle_type text,
   speed int,
   batch_id text,
   batch_status text,   
   PRIMARY KEY ((intersection, vehicle_type), date, time)
);

我在表格中的记录很少,如下所示

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:05:00+0200', 'CAR', 45, 'PROCESSED');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:07:02+0200', 'CAR', 46, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'XYZ', '2016-08-28', '2016-08-28 00:07:00+0200', 'CAR', 52, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:08:30+0200', 'CAR', 44, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:14:30+0200', 'CAR', 40, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'XYZ', '2016-08-28', '2016-08-28 00:09:00+0200', 'CAR', 45, 'NEW');


INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:08:00+0200', 'BUS', 43, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'ABC', '2016-08-28', '2016-08-28 00:07:00+0200', 'BUS', 40, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'XYZ', '2016-08-28', '2016-08-28 00:08:00+0200', 'BUS', 41, 'NEW');

INSERT INTO vehicle_speed (device_id, intersection, date, time, vehicle_type, speed, batch_status) VALUES(f81d4fae-7dec-11d0-a765-00a0c91e6bf7, 'XYZ', '2016-08-28', '2016-08-28 00:08:00+0200', 'BUS', 45, 'NEW');

我刚开始学习Scala&火花。你可以帮我用scala& amp帮助我编写以下用例的代码吗?火花?

  1. 获取尚未处理的所有行(即batch_status ='NEW')
  2. 获取特定交叉点的所有行(即ABC / XYZ)
  3. 获取特定车辆类型的所有行(即汽车/公共汽车)
  4. 获取持续时间为5分钟的所有行(即,如果最小时间戳与其他行的时间戳之间的差异<5分钟,我们需要获取该行)。
  5. 根据上述过滤器计算所有车辆的平均速度(即步骤1至步骤5)
  6. 创建一个新表,插入包含5分钟聚合时间间隔数据的记录。

0 个答案:

没有答案