窗口函数可将一列中的n行转换为单行

时间:2019-11-25 17:56:17

标签: scala dataframe apache-spark apache-spark-sql

我需要将一列中的25行数窗口化为数据帧中的一行。

输入数据如下所示。

+------+----------+---------------------------------------+
|ID    |TIME      |SGNL                                   |
+------+----------+---------------------------------------+
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360298,"SV":"0.0"}|
+------+----------+---------------------------------------+

我必须在此处应用窗口函数,以将25 SGNL用作特定的ID,并按时间在单行中排序。 我已经完成了用ID对数据框进行分区并在TIME中排序的窗口。 现在,我必须获取如下数据。

+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DTC   |DTCTS     |SGNL                                                                                                                                                           |
+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360298,"SV":"0.0"}                                        |
+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

如上所示,特定分区的SGNL列中的前25行应合并为一行。有什么办法可以做到这一点?

2 个答案:

答案 0 :(得分:0)

更新后的答案(2)

import spark.implicits._
import org.apache.spark.sql.functions._

  val df = (Seq(
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360298,"SV":"0.0"}""")
) ++ (1 to 51).map{_ => ("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}""")})
.toDF("ID", "TIME", "SGNL")
.withColumn("rownum", row_number().over(Window.partitionBy($"ID").orderBy($"TIME")))

df.groupBy($"ID", (($"rownum"-1)/25).cast(IntegerType).as("by25"))
.agg(min($"TIME"), collect_list($"SGNL"))
.drop("by25")
.toDF("DTC","DTCTS","SGNL")
.show(false)

+-----+----+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DTC  |by25|DTCTS     |SGNL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-----+----+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00001|0   |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}]|
|00001|1   |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}]|
|00001|2   |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
|00002|0   |1574360355|[{"SN":"Acc","ST":1574360297,"SV":"0.0"}, {"SN":"Acc","ST":1574360297,"SV":"0.0"}, {"SN":"Acc","ST":1574360298,"SV":"0.0"}]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+-----+----+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

请注意,结果现在是一个数组。

答案 1 :(得分:0)

您可以使用2个窗口功能来实现您的要求:

val df = Seq(
  ("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
  ("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
  ("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
  ("00001",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
  ("00002",1574360355,"""{"SN":"Acc","ST":1574360298,"SV":"0.0"}""")
).toDF("ID", "TIME", "SGNL")

val win =Window.partitionBy($"ID").orderBy($"TIME")

df
  .withColumn("rnb",row_number().over(win))
  .where($"rnb"<=25) // limit to first 25 rows
  .withColumn("SGNL",collect_list($"SGNL").over(win))
  .where($"rnb"===1) // collapse to 1 record per ID
  .withColumn("SGNL",concat_ws(",",$"SGNL")) // convert array to single string
  .drop($"rnb")
  .show()

给予:

+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID   |TIME      |SGNL                                                                                                                                                           |
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00001|1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002|1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360298,"SV":"0.0"}                                        |
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

请注意,Time现在代表汇总记录中的最小Time。如果您想要最大值Time,则需要另一个窗口函数来找到最大值rnb,然后对此进行过滤