如何使用Spark Scala或sql对特定时间间隔内的记录进行分组?

时间:2019-03-28 16:56:19

标签: sql scala apache-spark

仅当它们具有相同的ID并且它们的时间彼此相距1分钟以内时,我才希望在scala中对记录进行分组。

我在概念上在想这样的事情?但是我不太确定

HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time




| ID         |     volume  |           Time             |
|:-----------|------------:|:--------------------------:|
| 1          |      10     |    2019-02-17T12:00:34Z    |
| 2          |      20     |    2019-02-17T11:10:46Z    |
| 3          |      30     |    2019-02-17T13:23:34Z    |
| 1          |      40     |    2019-02-17T12:01:02Z    |
| 2          |      50     |    2019-02-17T11:10:30Z    |
| 1          |      60     |    2019-02-17T12:01:57Z    |

对此:

| ID         |     volume  | 
|:-----------|------------:|
| 1          |      50     |   // (10+40)
| 2          |      70     |   // (20+50)
| 3          |      30     |


df.groupBy($"ID", window($"Time", "1 minutes")).sum("volume")

上面的代码是1解决方案,但它总是四舍五入。

例如2019-02-17T12:00:45Z的范围为

2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.

我正在寻找这个: 2019-02-17T11:45:00Z TO 2019-02-17T12:01:45Z.

有办法吗?

1 个答案:

答案 0 :(得分:1)

org.apache.spark.sql.functions提供了如下重载的窗口功能。

1。 window(timeColumn:Column,windowDuration:String):给定指定列的时间戳,生成滚动时间窗口。窗口开始是包含在内的,但窗口结束是专有的,例如12:05将在窗口[12:05,12:10)中,但不在[12:00,12:05)中。

窗口将如下所示:

  {{{
    09:00:00-09:01:00
    09:01:00-09:02:00
    09:02:00-09:03:00 ...
  }}}

2。 window((timeColumn:Column,windowDuration:String,slideDuration:String):           给定时间戳记指定列,将行存储到一个或多个时间窗口中。窗口开始是包含在内的,但窗口结束是专有的,例如12:05将在窗口[12:05,12:10)中,但不在[12:00,12:05)中。      slideDuration 参数,指定窗口的滑动间隔,例如1 minute。每slideDuration将生成一个新窗口。必须小于或等于windowDuration

窗口将如下所示:

{{{
  09:00:00-09:01:00
  09:00:10-09:01:10
  09:00:20-09:01:20 ...
}}}

3。 window((timeColumn:Column,windowDuration:String,slideDuration:String,startTime:String):在给定时间戳记指定列的情况下,将行存储到一个或多个时间窗口中。 12:05将在窗口[12:05,12:10)中,但不在[12:00,12:05)中。

窗口将如下所示:

{{{
  09:00:05-09:01:05
  09:00:15-09:01:15
  09:00:25-09:01:25 ...
}}}

例如,为了使每小时的滚动窗口从每小时的15分钟开始,例如12:15-13:15,13:15-14:15 ...提供startTime作为15 minutes这是完美的重载窗口功能,可满足您的需求。

请找到以下工作代码。

import org.apache.spark.sql.SparkSession

object SparkWindowTest extends App {

  val spark = SparkSession
    .builder()
    .master("local")
    .appName("File_Streaming")
    .getOrCreate()

  import spark.implicits._
  import org.apache.spark.sql.functions._

  //Prepare Test Data
  val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
    (3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
    (1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
    .toDF("ID", "Volume", "TimeString")

  df.show()
  df.printSchema()

+---+------+-------------------+
| ID|Volume|         TimeString|
+---+------+-------------------+
|  1|    10|2019-02-17 12:00:49|
|  2|    20|2019-02-17 11:10:46|
|  3|    30|2019-02-17 13:23:34|
|  2|    50|2019-02-17 11:10:30|
|  1|    40|2019-02-17 12:01:02|
|  1|    60|2019-02-17 12:01:57|
+---+------+-------------------+

root
 |-- ID: integer (nullable = false)
 |-- Volume: integer (nullable = false)
 |-- TimeString: string (nullable = true)

  //Converted String Timestamp into Timestamp
  val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))

  //Dropped String Timestamp from DF
  val modifiedDF1 = modifiedDF.drop("TimeString")

  modifiedDF.show(false)
  modifiedDF.printSchema()

+---+------+-------------------+-------------------+
|ID |Volume|TimeString         |Time               |
+---+------+-------------------+-------------------+
|1  |10    |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2  |20    |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3  |30    |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2  |50    |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1  |40    |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1  |60    |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+

root
 |-- ID: integer (nullable = false)
 |-- Volume: integer (nullable = false)
 |-- TimeString: string (nullable = true)
 |-- Time: timestamp (nullable = true)

  modifiedDF1.show(false)
  modifiedDF1.printSchema()

+---+------+-------------------+
|ID |Volume|Time               |
+---+------+-------------------+
|1  |10    |2019-02-17 12:00:49|
|2  |20    |2019-02-17 11:10:46|
|3  |30    |2019-02-17 13:23:34|
|2  |50    |2019-02-17 11:10:30|
|1  |40    |2019-02-17 12:01:02|
|1  |60    |2019-02-17 12:01:57|
+---+------+-------------------+

root
 |-- ID: integer (nullable = false)
 |-- Volume: integer (nullable = false)
 |-- Time: timestamp (nullable = true)

  //Main logic
  val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")

  //Renamed all columns of DF.
  val newNames = Seq("ID", "WINDOW", "VOLUME")
  val finalDF = modifiedDF2.toDF(newNames: _*)

  finalDF.show(false)

+---+---------------------------------------------+------+
|ID |WINDOW                                       |VOLUME|
+---+---------------------------------------------+------+
|2  |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50    |
|1  |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60    |
|1  |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50    |
|3  |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30    |
|2  |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20    |
+---+---------------------------------------------+------+

}