Spark Structured Streaming窗口和分组操作

时间:2017-12-21 22:43:39

标签: scala apache-spark spark-structured-streaming

我已成功对我的流媒体数据框进行分组操作,以计算每次旅行中平均乘客人数。

val carSchema =
    new StructType()
    .add("trackId", StringType)
    .add("carId", StringType)
    .add("peopleCount", StringType)
    .add("time", StringType)

在这个问题上有几个赛车跑道(在我的情况下是3个)。他们每个人都有自己独特的“trackId”。在这些轨道内,可能有多辆汽车在行驶,每辆汽车都有一个单独的“carId”。我们还使用“peopleCount”跟踪车内有多少人。场“时间”对应于给定赛车的比赛开始时间。

因为我们想要计算汽车的平均人数,我们正在将“peopleCount”从字符串转换为int:

val dataFrame = 
    inputStream.selectExpr("CAST (content AS STRING) AS JSON")
    .select(from_json($"json", schema = carSchema)
    .as("carData"))
    .select("carData.*")
    .withColumn("peopleCount", toInt($"peopleCount"))

dataFrame.printSchema
root
 |-- trackId: string (nullable = true)
 |-- carId: string (nullable = true)
 |-- peopleCount: integer (nullable = true)
 |-- time: string (nullable = true)

供参考,数据如下所示:

|trackId                             |carId                               |peopleCount |time                        |
+------------------------------------+------------------------------------+------------------------------------------
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|2           |2017-12-20T23:04:14.7900000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|1           |2017-12-20T23:23:34.5510000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|2           |2017-12-20T19:27:57.7710000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|3           |2017-12-19T19:29:32.9790000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|4           |2017-12-19T19:31:12.6600000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|1           |2017-12-19T19:32:52.7190000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|2           |2017-12-19T23:45:06.4140000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|3           |2017-12-20T21:09:03.7440000Z|
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|2f16b0f9-164c-4e3d-a5c9-f672bcf87197|3           |2017-12-19T21:25:06.2340000Z|
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|2f16b0f9-164c-4e3d-a5c9-f672bcf87197|3           |2017-12-20T18:10:03.6540000Z|
<...more data...>

现在,因为我们想要找出每个赛道的平均车数:

val avgPeopleInCars = dataFrame.groupBy("trackId").avg("peopleCount")

这会返回正确的平均值。有3条赛道,我收到了3条线路,平均每辆赛道中的每一条都有汽车人数:

-------------------------------------------
Batch: 0
-------------------------------------------
+------------------------------------+-------------------+
|trackId                             |avg(peopleCount)  |
+------------------------------------+-------------------+
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|3.5               |
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0               |
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|1.0               |
+------------------------------------+-------------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------+-------------------+
|trackId                             |avg(peopleCount)  |
+------------------------------------+-------------------+
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.5               |
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.2               |
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|3.0               |
+------------------------------------+-------------------+

目前,我正在尝试了解如何将输出整形为使用窗口大小为3分钟且滑动间隔为1分钟的窗口。并且仍然进行相同的计算:每次旅行的平均车辆人数。我最初的尝试是:

val windowedData = 
    dataFrame
    .groupBy(window($"time", "3 minutes", "1 minute"), $"trackId")
    .avg("peopleCount")

windowedData.printSchema
root
 |-- window: struct (nullable = true)
 |    |-- start: timestamp (nullable = true)
 |    |-- end: timestamp (nullable = true)
 |-- trackId: string (nullable = true)
 |-- avg(peopleCount): double (nullable = true)

然而,这看起来并不合适。我希望收到与上一步相同类型的输出 - 每个窗口输出数据集应包含3行,每个赛道有一行。

-------------------------------------------
Batch: 0
-------------------------------------------
+---------------------------------------------+------------------------------------+-------------------+
|window                                       |trackId                             |avg(peopleCount)  |
+---------------------------------------------+------------------------------------+-------------------+
|[2017-12-18 23:02:00.0,2017-12-18 23:05:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
|[2017-12-18 23:03:00.0,2017-12-18 23:06:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
|[2017-12-18 23:04:00.0,2017-12-18 23:07:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
+---------------------------------------------+------------------------------------+-------------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------------------------+------------------------------------+-------------------+
|window                                       |trackId                             |avg(peopleCount)  |
+---------------------------------------------+------------------------------------+-------------------+
|[2017-12-18 23:02:00.0,2017-12-18 23:05:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
|[2017-12-18 23:03:00.0,2017-12-18 23:06:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
|[2017-12-18 23:04:00.0,2017-12-18 23:07:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0               |
|[2017-12-21 18:55:00.0,2017-12-21 18:58:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0               |
|[2017-12-21 18:56:00.0,2017-12-21 18:59:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0               |
|[2017-12-21 18:57:00.0,2017-12-21 19:00:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0               |
|[2017-12-21 18:59:00.0,2017-12-21 19:02:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|3.0               |
|[2017-12-21 19:00:00.0,2017-12-21 19:03:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0               |
|[2017-12-21 19:01:00.0,2017-12-21 19:04:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0               |
|[2017-12-21 19:02:00.0,2017-12-21 19:05:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.5               |
+---------------------------------------------+------------------------------------+-------------------+

1 个答案:

答案 0 :(得分:0)

如果仔细看,您会发现它没有返回多个条目,但是每个“窗口”只给您一个条目。

我不是窗口专家,但是我猜如果您加上合适的水印,您只会看到最新的窗口。