我正在尝试根据scala和spark中的时间序列提取组合的数据间隔
我在数据框中有以下数据:
Id | State | StartTime | EndTime
---+-------+---------------------+--------------------
1 | R | 2019-01-01T03:00:00 | 2019-01-01T11:30:00
1 | R | 2019-01-01T11:30:00 | 2019-01-01T15:00:00
1 | R | 2019-01-01T15:00:00 | 2019-01-01T22:00:00
1 | W | 2019-01-01T22:00:00 | 2019-01-02T04:30:00
1 | W | 2019-01-02T04:30:00 | 2019-01-02T13:45:00
1 | R | 2019-01-02T13:45:00 | 2019-01-02T18:30:00
1 | R | 2019-01-02T18:30:00 | 2019-01-02T22:45:00
我需要根据ID和状态将数据提取到时间间隔中。产生的数据需要看起来像:
Id | State | StartTime | EndTime
---+-------+---------------------+--------------------
1 | R | 2019-01-01T03:00:00 | 2019-01-01T22:00:00
1 | W | 2019-01-01T22:00:00 | 2019-01-02T13:45:00
1 | R | 2019-01-02T13:45:00 | 2019-01-02T22:45:00
请注意,前三个记录已分组在一起,因为设备从2019-01-01T03:00:00到2019-01-01T22:00:00连续处于R状态,然后切换到W状态对于从2019-01-01T22:00:00到2019-01-02T13:45:00的下两个记录,然后返回到最后两个记录的R状态。
答案 0 :(得分:0)
事实证明,答案是Combine rows when the end time of one is the start time of another (Oracle)转换为Spark。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col,row_number}
import spark.implicits._
val idSpec = Window.partitionBy('Id).orderBy('StartTime)
val idStateSpec = Window.partitionBy('Id,'State).orderBy('StartTime)
val df2 = df
.select('Id,'State,'StartTime,'EndTime,
row_number().over(idSpec).as("idRowNumber"),
row_number().over(idStateSpec).as("idStateRowNumber"))
.groupBy('Id,'State,'idRowNumber - 'idStateRowNumber)
.agg(min('StartTime).as("StartTime"), max('EndTime).as("EndTime"))
答案 1 :(得分:0)
由于我最近有一个类似的案例,所以我想为这个案例提供完整的解决方案。代码的一部分:
val df2 = df
.select('Id,'State,'StartTime,'EndTime,
row_number().over(idSpec).as("idRowNumber"),
row_number().over(idStateSpec).as("idStateRowNumber"))
有输出:
+---+-----+-------------------+-------------------+-----------+----------------+
| Id|State| StartTime| EndTime|idRowNumber|idStateRowNumber|
+---+-----+-------------------+-------------------+-----------+----------------+
| 1| R|2019-01-01 03:00:00|2019-01-01 11:30:00| 1| 1|
| 1| R|2019-01-01 11:30:00|2019-01-01 15:00:00| 2| 2|
| 1| R|2019-01-01 15:00:00|2019-01-01 22:00:00| 3| 3|
| 1| W|2019-01-01 22:00:00|2019-01-02 04:30:00| 4| 1|
| 1| W|2019-01-02 04:30:00|2019-01-02 13:45:00| 5| 2|
| 1| R|2019-01-02 13:45:00|2019-01-02 18:30:00| 6| 4|
| 1| R|2019-01-02 18:30:00|2019-01-02 22:45:00| 7| 5|
+---+-----+-------------------+-------------------+-----------+----------------+
请注意,对于(Id,State)的每种组合, idRowNumber 和 idStateRowNumber 之间的差异是相同的,因此我们可以创建一个新的此类别称为类别和分组的列,以便为每个分组获取最小StartTime和最大EndTime。完整的代码应类似于下一个代码:
val idSpec = Window.partitionBy('Id).orderBy('StartTime)
val idStateSpec = Window.partitionBy('Id,'State).orderBy('StartTime)
val df2 = df
.select('Id,'State,'StartTime.cast("timestamp"),'EndTime.cast("timestamp"),
row_number().over(idSpec).as("idRowNumber"),
row_number().over(idStateSpec).as("idStateRowNumber"))
.withColumn("Category", $"idRowNumber" - $"idStateRowNumber")
.groupBy("Category", "Id", "State")
.agg(min("StartTime").as("StartTime"), max("EndTime").as("EndTime"))
.drop("Category")
输出:
+---+-----+-------------------+-------------------+
| Id|State| StartTime| EndTime|
+---+-----+-------------------+-------------------+
| 1| R|2019-01-01 03:00:00|2019-01-01 22:00:00|
| 1| W|2019-01-01 22:00:00|2019-01-02 13:45:00|
| 1| R|2019-01-02 13:45:00|2019-01-02 22:45:00|
+---+-----+-------------------+-------------------+