Question

我在groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value"))上有一个数据集，它返回一个数据集，该数据集具有四列和时间列的max的分组数据，而collect_list具有该分组数据的所有值，例如{ {1}}，但我想要的[5,1]是与所有分组列匹配的值，并且不仅与分组列匹配还与max（“ _ 5”）。as（“ time”）匹配

以下代码：

_6

输出：

val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"),("thing1",1,1,"Temperature",1551502200000L,"1"))

 import org.apache.spark.sql.functions._
 val dataSet = spark.sparkContext.parallelize(data)
 import spark.implicits._
 val testDS = dataSet.toDS()
 testDS.groupby("_1","_2","_3","_4").agg(max("_5").as("time"),collect_list("_6").as("value")).show()

必需的输出

 |  _1     |  _2  |  _3  |  _4        |  time          |  value  |
 |thingId1 |  1   |  1   |Temperature |  1551502200000 | [5,1]   |

我不希望值5出现在值列中，因为它不在| _1 | _2 | _3 | _4 | time | value | |thingId1 | 1 | 1 |Temperature | 1551502200000 | 1 |下，我只需要在值列中输入1，因为它只匹配所有分组列和{ {1}}。

如何实现这一目标。

谢谢。

Answer 1

您可以通过使用argmax逻辑来整洁地这样做而不必使用Window函数，

val data = Seq(("thing1",1,1,"Temperature",1551501300000L,"5"), 
               ("thing1",1,1,"Temperature",1551502200000L,"1")).toDF

data.groupBy("_1","_2","_3","_4").agg(
     max(struct("_5", "_6")).as("argmax")).select("_1","_2","_3","_4", "argmax.*").show

+------+---+---+-----------+-------------+---+
|    _1| _2| _3|         _4|           _5| _6|
+------+---+---+-----------+-------------+---+
|thing1|  1|  1|Temperature|1551502200000|  1|
+------+---+---+-----------+-------------+---+

当您在max的火花上使用struct时，它将返回具有最高初值的struct，如果存在structs具有相同的初值，则它将到第二个值，依此类推。一旦有了max struct，您就可以使用struct通配符从*提取值。

Answer 2

在这种情况下使用Window函数：

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("_1","_2","_3","_4").orderBy(desc("_5"))

testDS.withColumn("rowSelector", row_number() over windowSpec)
    .where($"rowSelector" === 1)
    .drop($"rowSelector")
    .show(false)

输出：

+------+---+---+-----------+-------------+---+
|_1    |_2 |_3 |_4         |_5           |_6 |
+------+---+---+-----------+-------------+---+
|thing1|1  |1  |Temperature|1551502200000|1  |

Spark数据集-groupBy.agg（max（column），collect_list（column））

2 个答案: