我有一个如下数据框
+------++-----------------------+
| state| time stamp |
+------+------------------------+
| 0 | Sun Aug 13 10:58:44 |
| 1 | Sun Aug 13 11:59:44 |
| 1 | Sun Aug 13 12:50:43 |
| 1 | Sun Aug 13 13:00:44 |
| 0 | Sun Aug 13 13:58:42 |
| 0 | Sun Aug 13 14:00:41 |
| 0 | Sun Aug 13 14:30:45 |
| 0 | Sun Aug 13 14:58:46 |
| 1 | Sun Aug 13 15:00:47 |
| 0+ | Sun Aug 13 16:00:49 |
+------+------------------------+
我只需要在状态从1变为0时选择时间戳,
我需要单独分开这些行
Sun Aug 13 11:59:44
Sun Aug 13 13:58:42
Sun Aug 13 15:00:47
Sun Aug 13 16:00:49
然后拿出时差并总结。
有人可以建议,我应该为此写什么样的查询。
我需要一些结果如下
(13:58:42 - 11:59:44) + (16:00:49 - 15:00:47)
答案 0 :(得分:1)
Window
功能应该有助于满足您的第一需求。 Filter
将满足您的第三需求。通过从日期时间值中提取time
,可以满足您的第三个需求。
将数据框设为
+-----+-------------------+
|state|timestamp |
+-----+-------------------+
|0 |Sun Aug 13 10:58:44|
|1 |Sun Aug 13 11:59:44|
|1 |Sun Aug 13 12:50:43|
|1 |Sun Aug 13 13:00:44|
|0 |Sun Aug 13 13:58:42|
|0 |Sun Aug 13 14:00:41|
|0 |Sun Aug 13 14:30:45|
|0 |Sun Aug 13 14:58:46|
|1 |Sun Aug 13 15:00:47|
|0 |Sun Aug 13 16:00:49|
+-----+-------------------+
做我上面解释的事情应该会有所帮助。执行以下操作可以解决您的第一和第二需求。
import org.apache.spark.sql.functions._
df.withColumn("temp", lag("state", 1).over(Window.orderBy("timestamp")))
.withColumn("temp", when(col("temp").isNull, lit(0)).otherwise(col("temp")))
.filter(col("state") =!= col("temp"))
你应该
+-----+-------------------+----+
|state|timestamp |temp|
+-----+-------------------+----+
|1 |Sun Aug 13 11:59:44|0 |
|0 |Sun Aug 13 13:58:42|1 |
|1 |Sun Aug 13 15:00:47|0 |
|0 |Sun Aug 13 16:00:49|1 |
+-----+-------------------+----+
现在关于您的第三个需求,您需要找到从time
列中提取timestamp
并执行以下操作的方法
import org.apache.spark.sql.functions._
df.withColumn("temp", lag("state", 1).over(Window.orderBy("timestamp")))
.withColumn("temp", when(col("temp").isNull, lit(0)).otherwise(col("temp")))
.filter(col("state") =!= col("temp"))
.select(collect_list(col("timestamp")).as("time"))
.withColumn("time", concat_ws(" + ", concat_ws(" - ", $"time"(1), $"time"(0)), concat_ws(" - ", $"time"(3), $"time"(2))))
你应该
+-------------------------------------------------------------------------------------+
|time |
+-------------------------------------------------------------------------------------+
|Sun Aug 13 13:58:42 - Sun Aug 13 11:59:44 + Sun Aug 13 16:00:49 - Sun Aug 13 15:00:47|
+-------------------------------------------------------------------------------------+
除非从time
列中提取timestamp
值