我想找到类型化DataSet的ID的最后一条记录。 我找到了一个基于Dataframe的解决方案:“通过Spark groupBy dataframe找到时间戳的最小值”Find minimum for a timestamp through Spark groupBy dataframe
但是如何对类型化数据集做同样的事情?
类似的东西:
case class Person(id: Int, name: String, time: Timestamp, kind: String)
val ds:DataSet[Person] = Seq(
(1, "Bob", parseDate("03/08/02 00:00:00"), "P"),
(1, "Bob", parseDate("04/08/02 00:00:00"), "PI"),
(1, "Bob", parseDate("03/08/02 12:00:00"), "PE"))
.toDF("id", "name", "time", "kind").as[Person]
ds.groupByKey(_.id)
.agg(max(_.time), _)
// .agg(max(struct("time", columnsButTime: _*)) as "all") => Work with Datafrane
// .select("all.*")