基本上,我需要查看时间戳列表的累计分钟数。
Timestamp cum
2017-06-04 02:58:00, 0
2017-06-04 03:02:00, 4
2017-06-04 03:05:00, 7
2017-06-04 03:10:00, 12
这就是我正在努力的想法:
from pyspark.sql import Window as W
windowSpec =W.partitionBy(A["userid"]).orderBy(A["eventtime"])
acumEventTime = F.sum(col("eventtime")).over(windowSpec)
A.select("userid","eventtime", acumEventTime.alias("acumEventTime"))
我总结了一个窗口上的时间戳,它在acumEventTime
字段中给了我以下值:
acumEventTime
2.9930904E9,
1.4965452E9,
1.4965452E9,
1.4965452E9,
2.9930904E9
有没有有效的方法只显示会议记录?
答案 0 :(得分:1)
根据描述,我宁愿结合lag
和sum
:
from pyspark.sql.functions import col, coalesce, lag, lit, sum
from pyspark.sql.window import Window
df = (spark.createDataFrame([
(1, "2017-06-04 02:58:00"),
(1, "2017-06-04 03:02:00"),
(1, "2017-06-04 03:05:00"),
(1, "2017-06-04 03:10:00"),
])
.toDF("userid", "eventtime")
.withColumn("eventtime", col("eventtime").cast("timestamp")))
w = Window.partitionBy("userid").orderBy("eventtime")
cum = (sum(coalesce(
col("eventtime").cast("long") - lag("eventtime", 1).over(w).cast("long"),
lit(0)
)).over(w) / 60).cast("long")
df.withColumn("cum", cum).show()
+------+-------------------+---+
|userid| eventtime|cum|
+------+-------------------+---+
| 1|2017-06-04 02:58:00| 0|
| 1|2017-06-04 03:02:00| 4|
| 1|2017-06-04 03:05:00| 7|
| 1|2017-06-04 03:10:00| 12|
+------+-------------------+---+