Spark数据集-groupBy.agg(max(column),collect_list(column))

时间:2019-03-02 17:11:02

标签: apache-spark apache-spark-sql

我在groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value"))上有一个数据集,它返回一个数据集,该数据集具有四列和时间列的max的分组数据,而collect_list具有该分组数据的所有值,例如{ {1}},但我想要的[5,1]是与所有分组列匹配的值,并且不仅与分组列匹配还与max(“ _ 5”)。as(“ time”)匹配

以下代码:

_6

输出:

val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"),("thing1",1,1,"Temperature",1551502200000L,"1"))

 import org.apache.spark.sql.functions._
 val dataSet = spark.sparkContext.parallelize(data)
 import spark.implicits._
 val testDS = dataSet.toDS()
 testDS.groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value")).show()

必需的输出

 |  _1     |  _2  |  _3  |  _4        |  time          |  value  |
 |thingId1 |  1   |  1   |Temperature |  1551502200000 | [5,1]   |

我不希望值5出现在值列中,因为它不在 | _1 | _2 | _3 | _4 | time | value | |thingId1 | 1 | 1 |Temperature | 1551502200000 | 1 | 下,我只需要在值列中输入1,因为它只匹配所有分组列和{ {1}}。

如何实现这一目标。

谢谢。

2 个答案:

答案 0 :(得分:2)

您可以通过使用argmax逻辑来整洁地这样做而不必使用Window函数,

val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"), 
               ("thing1",1,1,"Temperature",1551502200000L,"1")).toDF

data.groupBy("_1","_2","_3","_4").agg(
     max(struct("_5", "_6")).as("argmax")).select("_1","_2","_3","_4", "argmax.*").show

+------+---+---+-----------+-------------+---+
|    _1| _2| _3|         _4|           _5| _6|
+------+---+---+-----------+-------------+---+
|thing1|  1|  1|Temperature|1551502200000|  1|
+------+---+---+-----------+-------------+---+

当您在max的火花上使用struct时,它将返回具有最高初值的struct,如果存在structs具有相同的初值,则它将到第二个值,依此类推。一旦有了max struct,您就可以使用struct通配符从*提取值。

答案 1 :(得分:1)

在这种情况下使用Window函数:

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("_1","_2","_3","_4").orderBy(desc("_5"))

testDS.withColumn("rowSelector", row_number() over windowSpec)
    .where($"rowSelector" === 1)
    .drop($"rowSelector")
    .show(false) 

输出:

+------+---+---+-----------+-------------+---+
|_1    |_2 |_3 |_4         |_5           |_6 |
+------+---+---+-----------+-------------+---+
|thing1|1  |1  |Temperature|1551502200000|1  |