我需要将一列中的25行数窗口化为数据帧中的一行。
输入数据如下所示。
+------+----------+---------------------------------------+
|ID |TIME |SGNL |
+------+----------+---------------------------------------+
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"}|
|00001 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360298,"SV":"0.0"}|
+------+----------+---------------------------------------+
我必须在此处应用窗口函数,以将25 SGNL用作特定的ID,并按时间在单行中排序。 我已经完成了用ID对数据框进行分区并在TIME中排序的窗口。 现在,我必须获取如下数据。
+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|DTC |DTCTS |SGNL |
+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00001 |1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002 |1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360298,"SV":"0.0"} |
+------+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
如上所示,特定分区的SGNL列中的前25行应合并为一行。有什么办法可以做到这一点?
答案 0 :(得分:0)
更新后的答案(2):
import spark.implicits._
import org.apache.spark.sql.functions._
val df = (Seq(
("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
("00002",1574360355,"""{"SN":"Acc","ST":1574360298,"SV":"0.0"}""")
) ++ (1 to 51).map{_ => ("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}""")})
.toDF("ID", "TIME", "SGNL")
.withColumn("rownum", row_number().over(Window.partitionBy($"ID").orderBy($"TIME")))
df.groupBy($"ID", (($"rownum"-1)/25).cast(IntegerType).as("by25"))
.agg(min($"TIME"), collect_list($"SGNL"))
.drop("by25")
.toDF("DTC","DTCTS","SGNL")
.show(false)

|DTC |by25|DTCTS |SGNL |

|00001|0 |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}]|
|00001|1 |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}, {"SN":"Acc","ST":1574360296,"SV":"0.0"}]|
|00001|2 |1574360355|[{"SN":"Acc","ST":1574360296,"SV":"0.0"}] |
|00002|0 |1574360355|[{"SN":"Acc","ST":1574360297,"SV":"0.0"}, {"SN":"Acc","ST":1574360297,"SV":"0.0"}, {"SN":"Acc","ST":1574360298,"SV":"0.0"}] |

请注意,结果现在是一个数组。
答案 1 :(得分:0)
您可以使用2个窗口功能来实现您的要求:
val df = Seq(
("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
("00001",1574360355,"""{"SN":"Acc","ST":1574360296,"SV":"0.0"}"""),
("00001",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
("00002",1574360355,"""{"SN":"Acc","ST":1574360297,"SV":"0.0"}"""),
("00002",1574360355,"""{"SN":"Acc","ST":1574360298,"SV":"0.0"}""")
).toDF("ID", "TIME", "SGNL")
val win =Window.partitionBy($"ID").orderBy($"TIME")
df
.withColumn("rnb",row_number().over(win))
.where($"rnb"<=25) // limit to first 25 rows
.withColumn("SGNL",collect_list($"SGNL").over(win))
.where($"rnb"===1) // collapse to 1 record per ID
.withColumn("SGNL",concat_ws(",",$"SGNL")) // convert array to single string
.drop($"rnb")
.show()
给予:
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ID |TIME |SGNL |
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|00001|1574360355|{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360296,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"}|
|00002|1574360355|{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360297,"SV":"0.0"},{"SN":"Acc","ST":1574360298,"SV":"0.0"} |
+-----+----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
请注意,Time
现在代表汇总记录中的最小Time
。如果您想要最大值Time
,则需要另一个窗口函数来找到最大值rnb
,然后对此进行过滤