我已成功对我的流媒体数据框进行分组操作,以计算每次旅行中平均乘客人数。
val carSchema =
new StructType()
.add("trackId", StringType)
.add("carId", StringType)
.add("peopleCount", StringType)
.add("time", StringType)
在这个问题上有几个赛车跑道(在我的情况下是3个)。他们每个人都有自己独特的“trackId”。在这些轨道内,可能有多辆汽车在行驶,每辆汽车都有一个单独的“carId”。我们还使用“peopleCount”跟踪车内有多少人。场“时间”对应于给定赛车的比赛开始时间。
因为我们想要计算汽车的平均人数,我们正在将“peopleCount”从字符串转换为int:
val dataFrame =
inputStream.selectExpr("CAST (content AS STRING) AS JSON")
.select(from_json($"json", schema = carSchema)
.as("carData"))
.select("carData.*")
.withColumn("peopleCount", toInt($"peopleCount"))
dataFrame.printSchema
root
|-- trackId: string (nullable = true)
|-- carId: string (nullable = true)
|-- peopleCount: integer (nullable = true)
|-- time: string (nullable = true)
供参考,数据如下所示:
|trackId |carId |peopleCount |time |
+------------------------------------+------------------------------------+------------------------------------------
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|2 |2017-12-20T23:04:14.7900000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|1 |2017-12-20T23:23:34.5510000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|2 |2017-12-20T19:27:57.7710000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|3 |2017-12-19T19:29:32.9790000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|4 |2017-12-19T19:31:12.6600000Z|
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|984ec5d7-f4a6-422b-aeb6-d130efaf0001|1 |2017-12-19T19:32:52.7190000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|2 |2017-12-19T23:45:06.4140000Z|
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|a85f22a3-5f57-4bde-ad00-5eeb303a9859|3 |2017-12-20T21:09:03.7440000Z|
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|2f16b0f9-164c-4e3d-a5c9-f672bcf87197|3 |2017-12-19T21:25:06.2340000Z|
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|2f16b0f9-164c-4e3d-a5c9-f672bcf87197|3 |2017-12-20T18:10:03.6540000Z|
<...more data...>
现在,因为我们想要找出每个赛道的平均车数:
val avgPeopleInCars = dataFrame.groupBy("trackId").avg("peopleCount")
这会返回正确的平均值。有3条赛道,我收到了3条线路,平均每辆赛道中的每一条都有汽车人数:
-------------------------------------------
Batch: 0
-------------------------------------------
+------------------------------------+-------------------+
|trackId |avg(peopleCount) |
+------------------------------------+-------------------+
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|3.5 |
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0 |
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|1.0 |
+------------------------------------+-------------------+
-------------------------------------------
Batch: 1
-------------------------------------------
+------------------------------------+-------------------+
|trackId |avg(peopleCount) |
+------------------------------------+-------------------+
|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.5 |
|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.2 |
|52f4c09c-7b9d-45d9-96ac-e0fe49458962|3.0 |
+------------------------------------+-------------------+
目前,我正在尝试了解如何将输出整形为使用窗口大小为3分钟且滑动间隔为1分钟的窗口。并且仍然进行相同的计算:每次旅行的平均车辆人数。我最初的尝试是:
val windowedData =
dataFrame
.groupBy(window($"time", "3 minutes", "1 minute"), $"trackId")
.avg("peopleCount")
windowedData.printSchema
root
|-- window: struct (nullable = true)
| |-- start: timestamp (nullable = true)
| |-- end: timestamp (nullable = true)
|-- trackId: string (nullable = true)
|-- avg(peopleCount): double (nullable = true)
然而,这看起来并不合适。我希望收到与上一步相同类型的输出 - 每个窗口输出数据集应包含3行,每个赛道有一行。
-------------------------------------------
Batch: 0
-------------------------------------------
+---------------------------------------------+------------------------------------+-------------------+
|window |trackId |avg(peopleCount) |
+---------------------------------------------+------------------------------------+-------------------+
|[2017-12-18 23:02:00.0,2017-12-18 23:05:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
|[2017-12-18 23:03:00.0,2017-12-18 23:06:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
|[2017-12-18 23:04:00.0,2017-12-18 23:07:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
+---------------------------------------------+------------------------------------+-------------------+
-------------------------------------------
Batch: 1
-------------------------------------------
+---------------------------------------------+------------------------------------+-------------------+
|window |trackId |avg(peopleCount) |
+---------------------------------------------+------------------------------------+-------------------+
|[2017-12-18 23:02:00.0,2017-12-18 23:05:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
|[2017-12-18 23:03:00.0,2017-12-18 23:06:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
|[2017-12-18 23:04:00.0,2017-12-18 23:07:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|1.0 |
|[2017-12-21 18:55:00.0,2017-12-21 18:58:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0 |
|[2017-12-21 18:56:00.0,2017-12-21 18:59:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0 |
|[2017-12-21 18:57:00.0,2017-12-21 19:00:00.0]|4ccfeb47-c76f-43f4-87bd-7a5777f78e7a|2.0 |
|[2017-12-21 18:59:00.0,2017-12-21 19:02:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|3.0 |
|[2017-12-21 19:00:00.0,2017-12-21 19:03:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0 |
|[2017-12-21 19:01:00.0,2017-12-21 19:04:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.0 |
|[2017-12-21 19:02:00.0,2017-12-21 19:05:00.0]|f261a42d-a7ac-4a2d-81b4-c5c7189a2b66|2.5 |
+---------------------------------------------+------------------------------------+-------------------+
答案 0 :(得分:0)
如果仔细看,您会发现它没有返回多个条目,但是每个“窗口”只给您一个条目。
我不是窗口专家,但是我猜如果您加上合适的水印,您只会看到最新的窗口。