关于窗函数的一些内容

时间:2017-09-09 11:20:45

标签: apache-spark group-by window

关于窗口功能的东西

如下面的代码,使用列time_test2上的窗口进行分组。

我希望生成群组00:00:00 to next day 00:00:00likes|2017-09-08 00:00:00|2017-09-09 00:00:00|

但结果来自08:00:00 to next day 08:00:00

为什么?

我该怎么做呢?

非常感谢

代码:

Dataset<Row> df3 = df2.groupBy(
            functions.window(df2.col("time_test2"),"1 days"),
            df2.col("info.item_id"),
            df2.col("info.rt")
).count().selectExpr("window.start", "window.end", "item_id", "rt", "count");

架构:

root
 |-- start: timestamp (nullable = true)
 |-- end: timestamp (nullable = true)
 |-- item_id: long (nullable = true)
 |-- rt: long (nullable = true)
 |-- count: long (nullable = false)

结果:

-------------------------------------------
Batch: 0
-------------------------------------------
+-------------------+-------------------+-------+---+-----+
|              start|                end|item_id| rt|count|
+-------------------+-------------------+-------+---+-----+
|2017-09-08 08:00:00|2017-09-09 08:00:00|      2|  4|   19|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      2|  3|   19|
|2017-09-08 08:00:00|2017-09-09 08:00:00|     10|  4|   15|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      6|  1|   26|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      1|  3|   25|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      5|  2|   24|
|2017-09-08 08:00:00|2017-09-09 08:00:00|     10|  1|   15|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      8|  2|   15|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      3|  3|   20|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      3|  4|   20|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      9|  4|   15|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      4|  4|   18|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      5|  1|   24|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      1|  4|   25|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      8|  3|   15|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      6|  3|   26|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      2|  1|   19|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      4|  3|   18|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      1|  2|   25|
|2017-09-08 08:00:00|2017-09-09 08:00:00|      5|  3|   24|
+-------------------+-------------------+-------+---+-----+
only showing top 20 rows

1 个答案:

答案 0 :(得分:0)

functions.window函数正在更改内置的timestamp,除非您自己创建类似的函数。

同时您可以使用以下简单的解决方案

import org.apache.spark.sql.functions.*
Dataset<Row> df3 = df2.groupBy(
      df2.col("time_test2"),
      df2.col("info.item_id"),
      df2.col("info.rt")
    ).count().select(df2.col("time_test2").as("start"), unix_timestamp(date_add(df2.col("time_test2"), 1),  "yyyy-MM-dd HH:mm:SS").cast(TimestampType).as("end"), col("item_id"), col("rt"), col("count"));

在Scala中

过程和逻辑是一样的

import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df3 = df2.groupBy("time_test2", "info.item_id", "info.rt").count()
.select($"time_test2".as("start"), unix_timestamp(date_add($"time_test2", 1), "yyyy-MM-dd HH:mm:SS").cast(TimestampType).as("end"), $"item_id", $"rt", $"count")