我在spark sql中按行动进行分组。在某些行中包含具有不同ID的相同值。在这种情况下,我想选择第一行。
这是我的代码。
val highvalueresult = highvalue.select($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID", $"RSSI_Weight_avg")
.groupBy("tagShortID", "Timestamp").agg(max($"RSSI_Weight_avg")
.alias("RSSI_Weight_avg"))
val t2 = averageDF.join(highvalueresult, Seq("tagShortID", "Timestamp", "RSSI_Weight_avg"))
这是我的结果。
tag,timestamp,rssi,listner,rootorg,suborg
2,1496745906,0.7,3878,4,3
4,1496745907,0.6,362,4,3
4,1496745907,0.6,718,4,3
4,1496745907,0.6,1901,4,3
在上面的时间戳1496745907结果中,三个listner的rssi值相同。在这种情况下,我想选择第一行。
答案 0 :(得分:8)
您可以使用spark sql context具有的窗口函数支持 假设您的数据帧是:
+---+----------+----+-------+-------+------+
|tag| timestamp|rssi|listner|rootorg|suborg|
+---+----------+----+-------+-------+------+
| 2|1496745906| 0.7| 3878| 4| 3|
| 4|1496745907| 0.6| 362| 4| 3|
| 4|1496745907| 0.6| 718| 4| 3|
| 4|1496745907| 0.6| 1901| 4| 3|
+---+----------+----+-------+-------+------+
将窗口函数定义为(您可以按列/按列排序):
val window = Window.partitionBy("timestamp", "rssi").orderBy("timestamp")
应用窗口功能:
res1.withColumn("rank", row_number().over(window))
+---+----------+----+-------+-------+------+----+
|tag| timestamp|rssi|listner|rootorg|suborg|rank|
+---+----------+----+-------+-------+------+----+
| 4|1496745907| 0.6| 362| 4| 3| 1|
| 4|1496745907| 0.6| 718| 4| 3| 2|
| 4|1496745907| 0.6| 1901| 4| 3| 3|
| 2|1496745906| 0.7| 3878| 4| 3| 1|
+---+----------+----+-------+-------+------+----+
从每个窗口中选择第一行
res5.where($"rank" === 1)
+---+----------+----+-------+-------+------+----+
|tag| timestamp|rssi|listner|rootorg|suborg|rank|
+---+----------+----+-------+-------+------+----+
| 4|1496745907| 0.6| 362| 4| 3| 1|
| 2|1496745906| 0.7| 3878| 4| 3| 1|
+---+----------+----+-------+-------+------+----+