Spark - 以分钟为单位的累积时间戳值

时间:2017-08-09 16:47:51

标签: apache-spark pyspark apache-spark-sql

基本上,我需要查看时间戳列表的累计分钟数。

Timestamp               cum  
2017-06-04 02:58:00,    0
2017-06-04 03:02:00,    4
2017-06-04 03:05:00,    7 
2017-06-04 03:10:00,    12 

这就是我正在努力的想法:

from pyspark.sql import Window as W

windowSpec =W.partitionBy(A["userid"]).orderBy(A["eventtime"])
acumEventTime = F.sum(col("eventtime")).over(windowSpec)
A.select("userid","eventtime", acumEventTime.alias("acumEventTime"))

我总结了一个窗口上的时间戳,它在acumEventTime字段中给了我以下值:

acumEventTime 
2.9930904E9,
1.4965452E9,
1.4965452E9,
1.4965452E9,
2.9930904E9

有没有有效的方法只显示会议记录?

1 个答案:

答案 0 :(得分:1)

根据描述,我宁愿结合lagsum

from pyspark.sql.functions import col, coalesce, lag, lit, sum
from pyspark.sql.window import Window

df = (spark.createDataFrame([
    (1, "2017-06-04 02:58:00"),
    (1, "2017-06-04 03:02:00"),
    (1, "2017-06-04 03:05:00"),
    (1, "2017-06-04 03:10:00"),
])
.toDF("userid", "eventtime")
.withColumn("eventtime", col("eventtime").cast("timestamp")))

w = Window.partitionBy("userid").orderBy("eventtime")

cum = (sum(coalesce(
    col("eventtime").cast("long") - lag("eventtime", 1).over(w).cast("long"),
    lit(0)
)).over(w) / 60).cast("long")

df.withColumn("cum", cum).show()

+------+-------------------+---+
|userid|          eventtime|cum|
+------+-------------------+---+
|     1|2017-06-04 02:58:00|  0|
|     1|2017-06-04 03:02:00|  4|
|     1|2017-06-04 03:05:00|  7|
|     1|2017-06-04 03:10:00| 12|
+------+-------------------+---+