关于窗口功能的东西
如下面的代码,使用列time_test2
上的窗口进行分组。
我希望生成群组00:00:00 to next day 00:00:00
,likes|2017-09-08 00:00:00|2017-09-09 00:00:00|
但结果来自08:00:00 to next day 08:00:00
为什么?
我该怎么做呢?
非常感谢
代码:
Dataset<Row> df3 = df2.groupBy(
functions.window(df2.col("time_test2"),"1 days"),
df2.col("info.item_id"),
df2.col("info.rt")
).count().selectExpr("window.start", "window.end", "item_id", "rt", "count");
架构:
root
|-- start: timestamp (nullable = true)
|-- end: timestamp (nullable = true)
|-- item_id: long (nullable = true)
|-- rt: long (nullable = true)
|-- count: long (nullable = false)
结果:
-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-------------------+-------+---+-----+
| start| end|item_id| rt|count|
+-------------------+-------------------+-------+---+-----+
|2017-09-08 08:00:00|2017-09-09 08:00:00| 2| 4| 19|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 2| 3| 19|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 10| 4| 15|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 6| 1| 26|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 1| 3| 25|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 5| 2| 24|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 10| 1| 15|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 8| 2| 15|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 3| 3| 20|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 3| 4| 20|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 9| 4| 15|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 4| 4| 18|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 5| 1| 24|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 1| 4| 25|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 8| 3| 15|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 6| 3| 26|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 2| 1| 19|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 4| 3| 18|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 1| 2| 25|
|2017-09-08 08:00:00|2017-09-09 08:00:00| 5| 3| 24|
+-------------------+-------------------+-------+---+-----+
only showing top 20 rows
答案 0 :(得分:0)
functions.window
函数正在更改内置的timestamp
,除非您自己创建类似的函数。
同时您可以使用以下简单的解决方案
import org.apache.spark.sql.functions.*
Dataset<Row> df3 = df2.groupBy(
df2.col("time_test2"),
df2.col("info.item_id"),
df2.col("info.rt")
).count().select(df2.col("time_test2").as("start"), unix_timestamp(date_add(df2.col("time_test2"), 1), "yyyy-MM-dd HH:mm:SS").cast(TimestampType).as("end"), col("item_id"), col("rt"), col("count"));
在Scala中
过程和逻辑是一样的
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df3 = df2.groupBy("time_test2", "info.item_id", "info.rt").count()
.select($"time_test2".as("start"), unix_timestamp(date_add($"time_test2", 1), "yyyy-MM-dd HH:mm:SS").cast(TimestampType).as("end"), $"item_id", $"rt", $"count")