我有一个由"事件","时间"," UserId"组成的日志文件。
+------------+----------------+---------+
| Events | Time | UserId |
+------------+----------------+---------+
| ClickA | 7/6/16 10:00am | userA |
+------------+----------------+---------+
| ClickB | 7/6/16 12:00am | userA |
+------------+----------------+---------+
我想为每个用户计算事件之间的平均时间。你们是如何解决这个问题的? 在传统的编程环境中,我会为用户浏览每个事件,并计算事件 n 和 n-1 之间的时间差,将此值添加到数组A.然后计算A中每个值的平均值。 我怎么能用Spark做到这一点?
答案 0 :(得分:1)
忽略解析它的日期看起来像一个窗口函数的作业,接着是一个简单的聚合,所以大概你需要这样的东西:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{lag, avg}
val df = Seq(
("ClickA", "2016-06-07 10:00:00", "UserA"),
("ClickB", "2016-06-07 12:00:00", "UserA")
).toDF("events", "time", "userid").withColumn("time", $"time".cast("timestamp"))
val w = Window.partitionBy("userid").orderBy("time")
// Difference between consecutive events in seconds
val diff = $"time".cast("long") - lag($"time", 1).over(w).cast("long")
df.withColumn("diff", diff).groupBy("userid").agg(avg($"diff"))