Scala Spark中的每小时聚合

时间:2018-04-18 17:35:34

标签: scala datetime apache-spark apache-spark-sql spark-dataframe

我正在寻找一种按小时聚合数据的方法。我首先要在evtTime中保留几个小时。我的DataFrame看起来像这样:

+-------+-----------------------+-----------+
|reqUser|evtTime                |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1          |
|X166815|2018-01-01 02:20:06.426|2          |
|X166816|2018-01-01 11:25:06.429|5          |
|X166817|2018-02-01 10:23:06.429|1          |
|X166818|2018-01-01 09:23:06.430|3          |
|X166819|2018-01-01 10:15:06.430|8          |
|X166820|2018-08-01 11:00:06.431|20         |
|X166821|2018-03-01 06:23:06.431|7          |
|X166822|2018-01-01 07:23:06.434|2          |
|X166823|2018-01-01 11:23:06.434|1          |
+-------+-----------------------+-----------+

我的目标是得到这样的东西:

+-------+-----------------------+-----------+
|reqUser|evtTime                |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1          |
|X166815|2018-01-01 02:00:00.000|2          |
|X166816|2018-01-01 11:00:00.000|5          |
|X166817|2018-02-01 10:00:00.000|1          |
|X166818|2018-01-01 09:00:00.000|3          |
|X166819|2018-01-01 10:00:00.000|8          |
|X166820|2018-08-01 11:00:00.000|20         |
|X166821|2018-03-01 06:00:00.000|7          |
|X166822|2018-01-01 07:00:00.000|2          |
|X166823|2018-01-01 11:00:00.000|1          |
+-------+-----------------------+-----------+

我正在使用scala 2.10.5和spark 1.6.3。我的目标是随后按reqUser分组并计算event_count的总和。我试过这个:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}

val new_df = df
  .groupBy($"reqUser",
    Window(col("evtTime"), "1 hour"))
  .agg(sum("event_count") as "aggregate_sum")

这是我的错误消息:

 Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
    Window(col("time"), "1 hour"))

帮助?谢谢!

1 个答案:

答案 0 :(得分:0)

In Spark 1.x you can use format tools

import org.apache.spark.sql.functions.trunc

val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")

df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
|                                         2018-01-01 10:00:00|
+------------------------------------------------------------+