我正在寻找一种按小时聚合数据的方法。我首先要在evtTime中保留几个小时。我的DataFrame看起来像这样:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1 |
|X166815|2018-01-01 02:20:06.426|2 |
|X166816|2018-01-01 11:25:06.429|5 |
|X166817|2018-02-01 10:23:06.429|1 |
|X166818|2018-01-01 09:23:06.430|3 |
|X166819|2018-01-01 10:15:06.430|8 |
|X166820|2018-08-01 11:00:06.431|20 |
|X166821|2018-03-01 06:23:06.431|7 |
|X166822|2018-01-01 07:23:06.434|2 |
|X166823|2018-01-01 11:23:06.434|1 |
+-------+-----------------------+-----------+
我的目标是得到这样的东西:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1 |
|X166815|2018-01-01 02:00:00.000|2 |
|X166816|2018-01-01 11:00:00.000|5 |
|X166817|2018-02-01 10:00:00.000|1 |
|X166818|2018-01-01 09:00:00.000|3 |
|X166819|2018-01-01 10:00:00.000|8 |
|X166820|2018-08-01 11:00:00.000|20 |
|X166821|2018-03-01 06:00:00.000|7 |
|X166822|2018-01-01 07:00:00.000|2 |
|X166823|2018-01-01 11:00:00.000|1 |
+-------+-----------------------+-----------+
我正在使用scala 2.10.5和spark 1.6.3。我的目标是随后按reqUser分组并计算event_count的总和。我试过这个:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}
val new_df = df
.groupBy($"reqUser",
Window(col("evtTime"), "1 hour"))
.agg(sum("event_count") as "aggregate_sum")
这是我的错误消息:
Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
Window(col("time"), "1 hour"))
帮助?谢谢!
答案 0 :(得分:0)
In Spark 1.x you can use format tools
import org.apache.spark.sql.functions.trunc
val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")
df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
| 2018-01-01 10:00:00|
+------------------------------------------------------------+